@Steven, w.r.t to your suggestion about doing the metadata operation during execution phase, see the related discussion in DRILL-3838.
A couple of more thoughts: - Parth and I were discussing keeping track of the merged schema as part of the refresh metadata and storing the merged schema for all files that have the identical schema (currently this is repeated and is a huge contributor to the size of the file). To Jacques' point about keeping minimum information needed for planning purposes, we certainly could do a better job in keeping it lean. The row count of the table could be computed at the time of running refresh metadata command. Similarly the analysis of single-value can be done at that time instead of on a per-query basis. - We should revisit DRILL-2517( https://issues.apache.org/jira/browse/DRILL-2517) Consider the following 2 queries and their total elapsed times against a table with 310000 files: (A) SELECT count(*) FROM table WHERE `date` = '2015-07-01'; elapsed time: 980 secs (B) SELECT count(*) FROM `table/20150701` ; elapsed time: 54 secs From the user perspective, both queries should perform nearly the same, which was essentially the intent of DRILL-2517. On Tue, Oct 27, 2015 at 12:04 PM, Steven Phillips <[email protected]> wrote: > I think we need to come up with a way to push partition pruning to > execution time. The other solutions may relive the problem in some cases, > but won't solve the fundamental problem. > > For example, even if we do figure out how to use multiple threads for > reading the metadata, that may be fine for a couple hundred thousand files, > but what about when we have millions or tens of millions of files. It will > still be a huge bottle neck. > > I actually think we should use the Drill execution engine to probe the > metadata and generate the work assignments. We could have an additional > fragment or fragments of the query that would recursively probe the > filesystem, read the metadata, and make assignments, and then pipe the > results into the Scanners, which will create readers on the fly. This way > the query could actually begin doing work before the metadata has even been > fully read. > > On Mon, Oct 26, 2015 at 2:42 PM, Jacques Nadeau <[email protected]> > wrote: > > > My first thought is we've gotten too generous in what we're storing in > the > > Parquet metadata file. Early implementations were very lean and it seems > > far larger today. For example, early implementations didn't keep > statistics > > and ignored row groups (files, schema and block locations only). If we > need > > multiple levels of information, we may want to stagger (or normalize) > them > > in the file. Also, we may think about what is the minimum that must be > done > > in planning. We could do the file pruning at execution time rather than > > single-tracking these things (makes stats harder though). > > > > I also think we should be cautious around jumping to a conclusion until > > DRILL-3973 provides more insight. > > > > In terms of caching, I'd be more inclined to rely on file system caching > > and make sure serialization/deserialization is as efficient as possible > as > > opposed to implementing an application-level cache. (We already have > enough > > problems managing memory without having to figure out when we should > drop a > > metadata cache :D). > > > > Aside, I always liked this post for entertainment and the thoughts on > > virtual memory: https://www.varnish-cache.org/trac/wiki/ArchitectNotes > > > > > > -- > > Jacques Nadeau > > CTO and Co-Founder, Dremio > > > > On Mon, Oct 26, 2015 at 2:25 PM, Hanifi Gunes <[email protected]> > wrote: > > > > > One more thing, for workloads running queries over subsets of same > > parquet > > > files, we can consider maintaining an in-memory cache as well. Assuming > > > metadata memory footprint per file is low and parquet files are static, > > not > > > needing us to invalidate the cache often. > > > > > > H+ > > > > > > On Mon, Oct 26, 2015 at 2:10 PM, Hanifi Gunes <[email protected]> > > wrote: > > > > > > > I am not familiar with the contents of metadata stored but if > > > > deserialization workload seems to be fitting to any of afterburner's > > > > claimed improvement points [1] It could well be worth trying given > the > > > > claimed gain on throughput is substantial. > > > > > > > > It could also be a good idea to partition caching over a number of > > files > > > > for better parallelization given number of cache files generated is > > > > *significantly* less than number of parquet files. Maintaining global > > > > statistics seems an improvement point too. > > > > > > > > > > > > -H+ > > > > > > > > 1: > > > > > > > > > > https://github.com/FasterXML/jackson-module-afterburner#what-is-optimized > > > > > > > > On Sun, Oct 25, 2015 at 9:33 AM, Aman Sinha <[email protected]> > > > wrote: > > > > > > > >> Forgot to include the link for Jackson's AfterBurner module: > > > >> https://github.com/FasterXML/jackson-module-afterburner > > > >> > > > >> On Sun, Oct 25, 2015 at 9:28 AM, Aman Sinha <[email protected]> > > > wrote: > > > >> > > > >> > I was going to file an enhancement JIRA but thought I will discuss > > > here > > > >> > first: > > > >> > > > > >> > The parquet metadata cache file is a JSON file that contains a > > subset > > > of > > > >> > the metadata extracted from the parquet files. The cache file can > > get > > > >> > really large .. a few GBs for a few hundred thousand files. > > > >> > I have filed a separate JIRA: DRILL-3973 for profiling the various > > > >> aspects > > > >> > of planning including metadata operations. In the meantime, the > > > >> timestamps > > > >> > in the drillbit.log output indicate a large chunk of time spent in > > > >> creating > > > >> > the drill table to begin with, which indicates bottleneck in > reading > > > the > > > >> > metadata. (I can provide performance numbers later once we > confirm > > > >> through > > > >> > profiling). > > > >> > > > > >> > A few thoughts around improvements: > > > >> > - The jackson deserialization of the JSON file is very slow.. can > > > this > > > >> be > > > >> > speeded up ? .. for instance the AfterBurner module of jackson > > claims > > > to > > > >> > improve performance by 30-40% by avoiding the use of reflection. > > > >> > - The cache file read is a single threaded process. If we were > > > >> directly > > > >> > reading from parquet files, we use a default of 16 threads. What > > can > > > be > > > >> > done to parallelize the read ? > > > >> > - Any operation that can be done one time during the REFRESH > > METADATA > > > >> > command ? for instance..examining the min/max values to determine > > > >> > single-value for partition column could be eliminated if we do > this > > > >> > computation during REFRESH METADATA command and store the summary > > one > > > >> time. > > > >> > > > > >> > - A pertinent question is: should the cache file be stored in a > > more > > > >> > efficient format such as Parquet instead of JSON ? > > > >> > > > > >> > Aman > > > >> > > > > >> > > > > >> > > > > > > > > > > > > > >
