Agree with Steven here. Pruning could be added to query profile for post execution verification. On Oct 29, 2015 11:33 AM, "Steven Phillips" <[email protected]> wrote:
> I agree that this would present a small challenge for testing, but I don't > think ease of testing should be the primary motivator in designing the > software. Once we've decided what we want the software to do, then we can > work together to figure out how to test it. > > On Thu, Oct 29, 2015 at 11:09 AM, rahul challapalli < > [email protected]> wrote: > > > @steven If we end up pushing the partition pruning to the execution > phase, > > how would we know that partition pruning even took place. I am thinking > > from the standpoint of adding functional tests around partition pruning. > > > > - Rahul > > > > On Wed, Oct 28, 2015 at 10:53 AM, Parth Chandra <[email protected]> > wrote: > > > > > And ideally, I suppose, the merged schema would correspond to the > > > information that we want to keep in a .drill file. > > > > > > > > > On Tue, Oct 27, 2015 at 4:55 PM, Aman Sinha <[email protected]> > wrote: > > > > > > > @Steven, w.r.t to your suggestion about doing the metadata operation > > > during > > > > execution phase, see the related discussion in DRILL-3838. > > > > > > > > A couple of more thoughts: > > > > - Parth and I were discussing keeping track of the merged schema as > > part > > > > of the refresh metadata and storing the merged schema for all files > > that > > > > have the identical schema (currently this is repeated and is a huge > > > > contributor to the size of the file). To Jacques' point about > keeping > > > > minimum information needed for planning purposes, we certainly could > > do > > > a > > > > better job in keeping it lean. The row count of the table could be > > > > computed at the time of running refresh metadata command. Similarly > > the > > > > analysis of single-value can be done at that time instead of on a > > > per-query > > > > basis. > > > > > > > > - We should revisit DRILL-2517( > > > > https://issues.apache.org/jira/browse/DRILL-2517) > > > > Consider the following 2 queries and their total elapsed times > > against > > > a > > > > table with 310000 files: > > > > (A) SELECT count(*) FROM table WHERE `date` = '2015-07-01'; > > > > elapsed time: 980 secs > > > > > > > > (B) SELECT count(*) FROM `table/20150701` ; > > > > elapsed time: 54 secs > > > > > > > > From the user perspective, both queries should perform nearly the > > > same, > > > > which was essentially the intent of DRILL-2517. > > > > > > > > > > > > On Tue, Oct 27, 2015 at 12:04 PM, Steven Phillips <[email protected] > > > > > > wrote: > > > > > > > > > I think we need to come up with a way to push partition pruning to > > > > > execution time. The other solutions may relive the problem in some > > > > cases, > > > > > but won't solve the fundamental problem. > > > > > > > > > > For example, even if we do figure out how to use multiple threads > for > > > > > reading the metadata, that may be fine for a couple hundred > thousand > > > > files, > > > > > but what about when we have millions or tens of millions of files. > It > > > > will > > > > > still be a huge bottle neck. > > > > > > > > > > I actually think we should use the Drill execution engine to probe > > the > > > > > metadata and generate the work assignments. We could have an > > additional > > > > > fragment or fragments of the query that would recursively probe the > > > > > filesystem, read the metadata, and make assignments, and then pipe > > the > > > > > results into the Scanners, which will create readers on the fly. > This > > > way > > > > > the query could actually begin doing work before the metadata has > > even > > > > been > > > > > fully read. > > > > > > > > > > On Mon, Oct 26, 2015 at 2:42 PM, Jacques Nadeau < > [email protected]> > > > > > wrote: > > > > > > > > > > > My first thought is we've gotten too generous in what we're > storing > > > in > > > > > the > > > > > > Parquet metadata file. Early implementations were very lean and > it > > > > seems > > > > > > far larger today. For example, early implementations didn't keep > > > > > statistics > > > > > > and ignored row groups (files, schema and block locations only). > If > > > we > > > > > need > > > > > > multiple levels of information, we may want to stagger (or > > normalize) > > > > > them > > > > > > in the file. Also, we may think about what is the minimum that > must > > > be > > > > > done > > > > > > in planning. We could do the file pruning at execution time > rather > > > than > > > > > > single-tracking these things (makes stats harder though). > > > > > > > > > > > > I also think we should be cautious around jumping to a conclusion > > > until > > > > > > DRILL-3973 provides more insight. > > > > > > > > > > > > In terms of caching, I'd be more inclined to rely on file system > > > > caching > > > > > > and make sure serialization/deserialization is as efficient as > > > possible > > > > > as > > > > > > opposed to implementing an application-level cache. (We already > > have > > > > > enough > > > > > > problems managing memory without having to figure out when we > > should > > > > > drop a > > > > > > metadata cache :D). > > > > > > > > > > > > Aside, I always liked this post for entertainment and the > thoughts > > on > > > > > > virtual memory: > > > https://www.varnish-cache.org/trac/wiki/ArchitectNotes > > > > > > > > > > > > > > > > > > -- > > > > > > Jacques Nadeau > > > > > > CTO and Co-Founder, Dremio > > > > > > > > > > > > On Mon, Oct 26, 2015 at 2:25 PM, Hanifi Gunes < > [email protected] > > > > > > > > wrote: > > > > > > > > > > > > > One more thing, for workloads running queries over subsets of > > same > > > > > > parquet > > > > > > > files, we can consider maintaining an in-memory cache as well. > > > > Assuming > > > > > > > metadata memory footprint per file is low and parquet files are > > > > static, > > > > > > not > > > > > > > needing us to invalidate the cache often. > > > > > > > > > > > > > > H+ > > > > > > > > > > > > > > On Mon, Oct 26, 2015 at 2:10 PM, Hanifi Gunes < > > [email protected] > > > > > > > > > > wrote: > > > > > > > > > > > > > > > I am not familiar with the contents of metadata stored but if > > > > > > > > deserialization workload seems to be fitting to any of > > > > afterburner's > > > > > > > > claimed improvement points [1] It could well be worth trying > > > given > > > > > the > > > > > > > > claimed gain on throughput is substantial. > > > > > > > > > > > > > > > > It could also be a good idea to partition caching over a > number > > > of > > > > > > files > > > > > > > > for better parallelization given number of cache files > > generated > > > is > > > > > > > > *significantly* less than number of parquet files. > Maintaining > > > > global > > > > > > > > statistics seems an improvement point too. > > > > > > > > > > > > > > > > > > > > > > > > -H+ > > > > > > > > > > > > > > > > 1: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/FasterXML/jackson-module-afterburner#what-is-optimized > > > > > > > > > > > > > > > > On Sun, Oct 25, 2015 at 9:33 AM, Aman Sinha < > > > [email protected]> > > > > > > > wrote: > > > > > > > > > > > > > > > >> Forgot to include the link for Jackson's AfterBurner module: > > > > > > > >> https://github.com/FasterXML/jackson-module-afterburner > > > > > > > >> > > > > > > > >> On Sun, Oct 25, 2015 at 9:28 AM, Aman Sinha < > > > [email protected] > > > > > > > > > > > > wrote: > > > > > > > >> > > > > > > > >> > I was going to file an enhancement JIRA but thought I will > > > > discuss > > > > > > > here > > > > > > > >> > first: > > > > > > > >> > > > > > > > > >> > The parquet metadata cache file is a JSON file that > > contains a > > > > > > subset > > > > > > > of > > > > > > > >> > the metadata extracted from the parquet files. The cache > > file > > > > can > > > > > > get > > > > > > > >> > really large .. a few GBs for a few hundred thousand > files. > > > > > > > >> > I have filed a separate JIRA: DRILL-3973 for profiling the > > > > various > > > > > > > >> aspects > > > > > > > >> > of planning including metadata operations. In the > meantime, > > > the > > > > > > > >> timestamps > > > > > > > >> > in the drillbit.log output indicate a large chunk of time > > > spent > > > > in > > > > > > > >> creating > > > > > > > >> > the drill table to begin with, which indicates bottleneck > in > > > > > reading > > > > > > > the > > > > > > > >> > metadata. (I can provide performance numbers later once > we > > > > > confirm > > > > > > > >> through > > > > > > > >> > profiling). > > > > > > > >> > > > > > > > > >> > A few thoughts around improvements: > > > > > > > >> > - The jackson deserialization of the JSON file is very > > slow.. > > > > can > > > > > > > this > > > > > > > >> be > > > > > > > >> > speeded up ? .. for instance the AfterBurner module of > > jackson > > > > > > claims > > > > > > > to > > > > > > > >> > improve performance by 30-40% by avoiding the use of > > > reflection. > > > > > > > >> > - The cache file read is a single threaded process. If > we > > > were > > > > > > > >> directly > > > > > > > >> > reading from parquet files, we use a default of 16 > threads. > > > > What > > > > > > can > > > > > > > be > > > > > > > >> > done to parallelize the read ? > > > > > > > >> > - Any operation that can be done one time during the > > REFRESH > > > > > > METADATA > > > > > > > >> > command ? for instance..examining the min/max values to > > > > determine > > > > > > > >> > single-value for partition column could be eliminated if > we > > do > > > > > this > > > > > > > >> > computation during REFRESH METADATA command and store the > > > > summary > > > > > > one > > > > > > > >> time. > > > > > > > >> > > > > > > > > >> > - A pertinent question is: should the cache file be > stored > > > in a > > > > > > more > > > > > > > >> > efficient format such as Parquet instead of JSON ? > > > > > > > >> > > > > > > > > >> > Aman > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
