My first thought is we've gotten too generous in what we're storing in the Parquet metadata file. Early implementations were very lean and it seems far larger today. For example, early implementations didn't keep statistics and ignored row groups (files, schema and block locations only). If we need multiple levels of information, we may want to stagger (or normalize) them in the file. Also, we may think about what is the minimum that must be done in planning. We could do the file pruning at execution time rather than single-tracking these things (makes stats harder though).
I also think we should be cautious around jumping to a conclusion until DRILL-3973 provides more insight. In terms of caching, I'd be more inclined to rely on file system caching and make sure serialization/deserialization is as efficient as possible as opposed to implementing an application-level cache. (We already have enough problems managing memory without having to figure out when we should drop a metadata cache :D). Aside, I always liked this post for entertainment and the thoughts on virtual memory: https://www.varnish-cache.org/trac/wiki/ArchitectNotes -- Jacques Nadeau CTO and Co-Founder, Dremio On Mon, Oct 26, 2015 at 2:25 PM, Hanifi Gunes <[email protected]> wrote: > One more thing, for workloads running queries over subsets of same parquet > files, we can consider maintaining an in-memory cache as well. Assuming > metadata memory footprint per file is low and parquet files are static, not > needing us to invalidate the cache often. > > H+ > > On Mon, Oct 26, 2015 at 2:10 PM, Hanifi Gunes <[email protected]> wrote: > > > I am not familiar with the contents of metadata stored but if > > deserialization workload seems to be fitting to any of afterburner's > > claimed improvement points [1] It could well be worth trying given the > > claimed gain on throughput is substantial. > > > > It could also be a good idea to partition caching over a number of files > > for better parallelization given number of cache files generated is > > *significantly* less than number of parquet files. Maintaining global > > statistics seems an improvement point too. > > > > > > -H+ > > > > 1: > > > https://github.com/FasterXML/jackson-module-afterburner#what-is-optimized > > > > On Sun, Oct 25, 2015 at 9:33 AM, Aman Sinha <[email protected]> > wrote: > > > >> Forgot to include the link for Jackson's AfterBurner module: > >> https://github.com/FasterXML/jackson-module-afterburner > >> > >> On Sun, Oct 25, 2015 at 9:28 AM, Aman Sinha <[email protected]> > wrote: > >> > >> > I was going to file an enhancement JIRA but thought I will discuss > here > >> > first: > >> > > >> > The parquet metadata cache file is a JSON file that contains a subset > of > >> > the metadata extracted from the parquet files. The cache file can get > >> > really large .. a few GBs for a few hundred thousand files. > >> > I have filed a separate JIRA: DRILL-3973 for profiling the various > >> aspects > >> > of planning including metadata operations. In the meantime, the > >> timestamps > >> > in the drillbit.log output indicate a large chunk of time spent in > >> creating > >> > the drill table to begin with, which indicates bottleneck in reading > the > >> > metadata. (I can provide performance numbers later once we confirm > >> through > >> > profiling). > >> > > >> > A few thoughts around improvements: > >> > - The jackson deserialization of the JSON file is very slow.. can > this > >> be > >> > speeded up ? .. for instance the AfterBurner module of jackson claims > to > >> > improve performance by 30-40% by avoiding the use of reflection. > >> > - The cache file read is a single threaded process. If we were > >> directly > >> > reading from parquet files, we use a default of 16 threads. What can > be > >> > done to parallelize the read ? > >> > - Any operation that can be done one time during the REFRESH METADATA > >> > command ? for instance..examining the min/max values to determine > >> > single-value for partition column could be eliminated if we do this > >> > computation during REFRESH METADATA command and store the summary one > >> time. > >> > > >> > - A pertinent question is: should the cache file be stored in a more > >> > efficient format such as Parquet instead of JSON ? > >> > > >> > Aman > >> > > >> > > >> > > > > >
