It's been on my things to try. I'll give it a shot. At the moment, I've got the file size reduced to 40% of the previous. With some other changes this might give us good enough performance, but it would be far from ideal.
Any other thoughts are welcome. I'll run some experiments and then write up a proposal. On Wed, Nov 4, 2015 at 9:15 AM, Aman Sinha <[email protected]> wrote: > It would be good to try; however I recall that we encountered a > SchemaChangeException when querying the JSON cache file. Parth might have > more success once he has simplified the metadata. > > Aman > > On Wed, Nov 4, 2015 at 8:31 AM, Jacques Nadeau <[email protected]> wrote: > > > I've been thinking more about this and I think Aman's suggestion of > Parquet > > files is worth a poc. > > > > What we could do: > > > > Run a select * order by partCol1, partCol2, ... , partColN query against > > the existing large json partition file and create a new Parquet version > of > > the file. > > Hand write a partition type read against the Parquet APIs using the > filter > > APIs and see what performance looks like. > > > > Thoughts? > > > > -- > > Jacques Nadeau > > CTO and Co-Founder, Dremio > > > > On Fri, Oct 30, 2015 at 3:36 PM, Parth Chandra <[email protected]> > wrote: > > > > > Thanks Steven for the link. > > > Your suggestion of storing only the single valued columns is a good > one. > > > It might be OK to have some of the count* queries run a little slower > as > > > reading the cache itself is taking way to long. I'm also looking at > > > squashing the column datatype info as there is a lot of redundancy > there. > > > > > > > > > > > > > > > On Fri, Oct 30, 2015 at 3:22 PM, Steven Phillips <[email protected]> > > > wrote: > > > > > > > My view on storing it in some other format is that, yes, it will > > probably > > > > reduce the size of the file, but if we gzip the json file, it should > be > > > > pretty compact. As for deserialization cost, other formats would be > > > faster, > > > > but not dramatically faster. Certainly not the order of magnitude > > faster > > > > that we really need it to be. The reason we chose JSON was because it > > is > > > > readable and easier to deal with. > > > > > > > > As for the old code, I can point you at a branch, but it's probably > not > > > > very helpful. Unless we want to essentially disable value-based > > partition > > > > pruning when using the cache, the old code will not work. > > > > > > > > My recommendation would be to come up with a new version of the > format > > > > which stores only the name and value of columns which are > single-valued > > > for > > > > each file or row group. This will allow partition pruning to work, > but > > > some > > > > count queries may not be as fast any more, because the cache won't > have > > > > column value counts on a per-rowgroup basis any more. > > > > > > > > Anyway, here is the link to the original branch. > > > > > > > > https://github.com/StevenMPhillips/drill/tree/meta > > > > > > > > On Fri, Oct 30, 2015 at 3:01 PM, Parth Chandra <[email protected]> > > > wrote: > > > > > > > > > Hey Jacques, Steven, > > > > > > > > > > Do we have a branch somewhere which has the initial prototype > code? > > > I'd > > > > > like to prune the file a bit as it looks like reducing the size of > > the > > > > > metadata cache file might yield the best results. > > > > > > > > > > Also, did we have a particular reason for going with JSON as > > opposed > > > > to a > > > > > more compact binary format? Are there any arguments against saving > > this > > > > as > > > > > a protobuf/BSON/Parquet file? > > > > > > > > > > Parth > > > > > > > > > > On Mon, Oct 26, 2015 at 2:42 PM, Jacques Nadeau < > [email protected]> > > > > > wrote: > > > > > > > > > > > My first thought is we've gotten too generous in what we're > storing > > > in > > > > > the > > > > > > Parquet metadata file. Early implementations were very lean and > it > > > > seems > > > > > > far larger today. For example, early implementations didn't keep > > > > > statistics > > > > > > and ignored row groups (files, schema and block locations only). > If > > > we > > > > > need > > > > > > multiple levels of information, we may want to stagger (or > > normalize) > > > > > them > > > > > > in the file. Also, we may think about what is the minimum that > must > > > be > > > > > done > > > > > > in planning. We could do the file pruning at execution time > rather > > > than > > > > > > single-tracking these things (makes stats harder though). > > > > > > > > > > > > I also think we should be cautious around jumping to a conclusion > > > until > > > > > > DRILL-3973 provides more insight. > > > > > > > > > > > > In terms of caching, I'd be more inclined to rely on file system > > > > caching > > > > > > and make sure serialization/deserialization is as efficient as > > > possible > > > > > as > > > > > > opposed to implementing an application-level cache. (We already > > have > > > > > enough > > > > > > problems managing memory without having to figure out when we > > should > > > > > drop a > > > > > > metadata cache :D). > > > > > > > > > > > > Aside, I always liked this post for entertainment and the > thoughts > > on > > > > > > virtual memory: > > > https://www.varnish-cache.org/trac/wiki/ArchitectNotes > > > > > > > > > > > > > > > > > > -- > > > > > > Jacques Nadeau > > > > > > CTO and Co-Founder, Dremio > > > > > > > > > > > > On Mon, Oct 26, 2015 at 2:25 PM, Hanifi Gunes < > [email protected] > > > > > > > > wrote: > > > > > > > > > > > > > One more thing, for workloads running queries over subsets of > > same > > > > > > parquet > > > > > > > files, we can consider maintaining an in-memory cache as well. > > > > Assuming > > > > > > > metadata memory footprint per file is low and parquet files are > > > > static, > > > > > > not > > > > > > > needing us to invalidate the cache often. > > > > > > > > > > > > > > H+ > > > > > > > > > > > > > > On Mon, Oct 26, 2015 at 2:10 PM, Hanifi Gunes < > > [email protected] > > > > > > > > > > wrote: > > > > > > > > > > > > > > > I am not familiar with the contents of metadata stored but if > > > > > > > > deserialization workload seems to be fitting to any of > > > > afterburner's > > > > > > > > claimed improvement points [1] It could well be worth trying > > > given > > > > > the > > > > > > > > claimed gain on throughput is substantial. > > > > > > > > > > > > > > > > It could also be a good idea to partition caching over a > number > > > of > > > > > > files > > > > > > > > for better parallelization given number of cache files > > generated > > > is > > > > > > > > *significantly* less than number of parquet files. > Maintaining > > > > global > > > > > > > > statistics seems an improvement point too. > > > > > > > > > > > > > > > > > > > > > > > > -H+ > > > > > > > > > > > > > > > > 1: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/FasterXML/jackson-module-afterburner#what-is-optimized > > > > > > > > > > > > > > > > On Sun, Oct 25, 2015 at 9:33 AM, Aman Sinha < > > > [email protected]> > > > > > > > wrote: > > > > > > > > > > > > > > > >> Forgot to include the link for Jackson's AfterBurner module: > > > > > > > >> https://github.com/FasterXML/jackson-module-afterburner > > > > > > > >> > > > > > > > >> On Sun, Oct 25, 2015 at 9:28 AM, Aman Sinha < > > > [email protected] > > > > > > > > > > > > wrote: > > > > > > > >> > > > > > > > >> > I was going to file an enhancement JIRA but thought I will > > > > discuss > > > > > > > here > > > > > > > >> > first: > > > > > > > >> > > > > > > > > >> > The parquet metadata cache file is a JSON file that > > contains a > > > > > > subset > > > > > > > of > > > > > > > >> > the metadata extracted from the parquet files. The cache > > file > > > > can > > > > > > get > > > > > > > >> > really large .. a few GBs for a few hundred thousand > files. > > > > > > > >> > I have filed a separate JIRA: DRILL-3973 for profiling the > > > > various > > > > > > > >> aspects > > > > > > > >> > of planning including metadata operations. In the > meantime, > > > the > > > > > > > >> timestamps > > > > > > > >> > in the drillbit.log output indicate a large chunk of time > > > spent > > > > in > > > > > > > >> creating > > > > > > > >> > the drill table to begin with, which indicates bottleneck > in > > > > > reading > > > > > > > the > > > > > > > >> > metadata. (I can provide performance numbers later once > we > > > > > confirm > > > > > > > >> through > > > > > > > >> > profiling). > > > > > > > >> > > > > > > > > >> > A few thoughts around improvements: > > > > > > > >> > - The jackson deserialization of the JSON file is very > > slow.. > > > > can > > > > > > > this > > > > > > > >> be > > > > > > > >> > speeded up ? .. for instance the AfterBurner module of > > jackson > > > > > > claims > > > > > > > to > > > > > > > >> > improve performance by 30-40% by avoiding the use of > > > reflection. > > > > > > > >> > - The cache file read is a single threaded process. If > we > > > were > > > > > > > >> directly > > > > > > > >> > reading from parquet files, we use a default of 16 > threads. > > > > What > > > > > > can > > > > > > > be > > > > > > > >> > done to parallelize the read ? > > > > > > > >> > - Any operation that can be done one time during the > > REFRESH > > > > > > METADATA > > > > > > > >> > command ? for instance..examining the min/max values to > > > > determine > > > > > > > >> > single-value for partition column could be eliminated if > we > > do > > > > > this > > > > > > > >> > computation during REFRESH METADATA command and store the > > > > summary > > > > > > one > > > > > > > >> time. > > > > > > > >> > > > > > > > > >> > - A pertinent question is: should the cache file be > stored > > > in a > > > > > > more > > > > > > > >> > efficient format such as Parquet instead of JSON ? > > > > > > > >> > > > > > > > > >> > Aman > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
