I definitely think we should subdivide into a couple different spaces: XX files where we only need to query N (via partition pruning) XX files where we need to query all or most of the files
where XX is [10, 10M, 10MM] and N is [100%, 0.0001%] Is that a reasonable matrix? I always don't want to make everything more complicated for the 10MM/0.0001% case at the cost of the other cases as I'm not sure how common that case is. I also think that Steven's parallelization comments are more applicable to some of these than others. -- Jacques Nadeau CTO and Co-Founder, Dremio On Wed, Nov 4, 2015 at 3:59 PM, Parth Chandra <[email protected]> wrote: > It's been on my things to try. I'll give it a shot. At the moment, I've got > the file size reduced to 40% of the previous. With some other changes this > might give us good enough performance, but it would be far from ideal. > > Any other thoughts are welcome. I'll run some experiments and then write up > a proposal. > > > > On Wed, Nov 4, 2015 at 9:15 AM, Aman Sinha <[email protected]> wrote: > > > It would be good to try; however I recall that we encountered a > > SchemaChangeException when querying the JSON cache file. Parth might > have > > more success once he has simplified the metadata. > > > > Aman > > > > On Wed, Nov 4, 2015 at 8:31 AM, Jacques Nadeau <[email protected]> > wrote: > > > > > I've been thinking more about this and I think Aman's suggestion of > > Parquet > > > files is worth a poc. > > > > > > What we could do: > > > > > > Run a select * order by partCol1, partCol2, ... , partColN query > against > > > the existing large json partition file and create a new Parquet version > > of > > > the file. > > > Hand write a partition type read against the Parquet APIs using the > > filter > > > APIs and see what performance looks like. > > > > > > Thoughts? > > > > > > -- > > > Jacques Nadeau > > > CTO and Co-Founder, Dremio > > > > > > On Fri, Oct 30, 2015 at 3:36 PM, Parth Chandra <[email protected]> > > wrote: > > > > > > > Thanks Steven for the link. > > > > Your suggestion of storing only the single valued columns is a good > > one. > > > > It might be OK to have some of the count* queries run a little slower > > as > > > > reading the cache itself is taking way to long. I'm also looking at > > > > squashing the column datatype info as there is a lot of redundancy > > there. > > > > > > > > > > > > > > > > > > > > On Fri, Oct 30, 2015 at 3:22 PM, Steven Phillips <[email protected]> > > > > wrote: > > > > > > > > > My view on storing it in some other format is that, yes, it will > > > probably > > > > > reduce the size of the file, but if we gzip the json file, it > should > > be > > > > > pretty compact. As for deserialization cost, other formats would be > > > > faster, > > > > > but not dramatically faster. Certainly not the order of magnitude > > > faster > > > > > that we really need it to be. The reason we chose JSON was because > it > > > is > > > > > readable and easier to deal with. > > > > > > > > > > As for the old code, I can point you at a branch, but it's probably > > not > > > > > very helpful. Unless we want to essentially disable value-based > > > partition > > > > > pruning when using the cache, the old code will not work. > > > > > > > > > > My recommendation would be to come up with a new version of the > > format > > > > > which stores only the name and value of columns which are > > single-valued > > > > for > > > > > each file or row group. This will allow partition pruning to work, > > but > > > > some > > > > > count queries may not be as fast any more, because the cache won't > > have > > > > > column value counts on a per-rowgroup basis any more. > > > > > > > > > > Anyway, here is the link to the original branch. > > > > > > > > > > https://github.com/StevenMPhillips/drill/tree/meta > > > > > > > > > > On Fri, Oct 30, 2015 at 3:01 PM, Parth Chandra <[email protected]> > > > > wrote: > > > > > > > > > > > Hey Jacques, Steven, > > > > > > > > > > > > Do we have a branch somewhere which has the initial prototype > > code? > > > > I'd > > > > > > like to prune the file a bit as it looks like reducing the size > of > > > the > > > > > > metadata cache file might yield the best results. > > > > > > > > > > > > Also, did we have a particular reason for going with JSON as > > > opposed > > > > > to a > > > > > > more compact binary format? Are there any arguments against > saving > > > this > > > > > as > > > > > > a protobuf/BSON/Parquet file? > > > > > > > > > > > > Parth > > > > > > > > > > > > On Mon, Oct 26, 2015 at 2:42 PM, Jacques Nadeau < > > [email protected]> > > > > > > wrote: > > > > > > > > > > > > > My first thought is we've gotten too generous in what we're > > storing > > > > in > > > > > > the > > > > > > > Parquet metadata file. Early implementations were very lean and > > it > > > > > seems > > > > > > > far larger today. For example, early implementations didn't > keep > > > > > > statistics > > > > > > > and ignored row groups (files, schema and block locations > only). > > If > > > > we > > > > > > need > > > > > > > multiple levels of information, we may want to stagger (or > > > normalize) > > > > > > them > > > > > > > in the file. Also, we may think about what is the minimum that > > must > > > > be > > > > > > done > > > > > > > in planning. We could do the file pruning at execution time > > rather > > > > than > > > > > > > single-tracking these things (makes stats harder though). > > > > > > > > > > > > > > I also think we should be cautious around jumping to a > conclusion > > > > until > > > > > > > DRILL-3973 provides more insight. > > > > > > > > > > > > > > In terms of caching, I'd be more inclined to rely on file > system > > > > > caching > > > > > > > and make sure serialization/deserialization is as efficient as > > > > possible > > > > > > as > > > > > > > opposed to implementing an application-level cache. (We already > > > have > > > > > > enough > > > > > > > problems managing memory without having to figure out when we > > > should > > > > > > drop a > > > > > > > metadata cache :D). > > > > > > > > > > > > > > Aside, I always liked this post for entertainment and the > > thoughts > > > on > > > > > > > virtual memory: > > > > https://www.varnish-cache.org/trac/wiki/ArchitectNotes > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > Jacques Nadeau > > > > > > > CTO and Co-Founder, Dremio > > > > > > > > > > > > > > On Mon, Oct 26, 2015 at 2:25 PM, Hanifi Gunes < > > [email protected] > > > > > > > > > > wrote: > > > > > > > > > > > > > > > One more thing, for workloads running queries over subsets of > > > same > > > > > > > parquet > > > > > > > > files, we can consider maintaining an in-memory cache as > well. > > > > > Assuming > > > > > > > > metadata memory footprint per file is low and parquet files > are > > > > > static, > > > > > > > not > > > > > > > > needing us to invalidate the cache often. > > > > > > > > > > > > > > > > H+ > > > > > > > > > > > > > > > > On Mon, Oct 26, 2015 at 2:10 PM, Hanifi Gunes < > > > [email protected] > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > I am not familiar with the contents of metadata stored but > if > > > > > > > > > deserialization workload seems to be fitting to any of > > > > > afterburner's > > > > > > > > > claimed improvement points [1] It could well be worth > trying > > > > given > > > > > > the > > > > > > > > > claimed gain on throughput is substantial. > > > > > > > > > > > > > > > > > > It could also be a good idea to partition caching over a > > number > > > > of > > > > > > > files > > > > > > > > > for better parallelization given number of cache files > > > generated > > > > is > > > > > > > > > *significantly* less than number of parquet files. > > Maintaining > > > > > global > > > > > > > > > statistics seems an improvement point too. > > > > > > > > > > > > > > > > > > > > > > > > > > > -H+ > > > > > > > > > > > > > > > > > > 1: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/FasterXML/jackson-module-afterburner#what-is-optimized > > > > > > > > > > > > > > > > > > On Sun, Oct 25, 2015 at 9:33 AM, Aman Sinha < > > > > [email protected]> > > > > > > > > wrote: > > > > > > > > > > > > > > > > > >> Forgot to include the link for Jackson's AfterBurner > module: > > > > > > > > >> https://github.com/FasterXML/jackson-module-afterburner > > > > > > > > >> > > > > > > > > >> On Sun, Oct 25, 2015 at 9:28 AM, Aman Sinha < > > > > [email protected] > > > > > > > > > > > > > > wrote: > > > > > > > > >> > > > > > > > > >> > I was going to file an enhancement JIRA but thought I > will > > > > > discuss > > > > > > > > here > > > > > > > > >> > first: > > > > > > > > >> > > > > > > > > > >> > The parquet metadata cache file is a JSON file that > > > contains a > > > > > > > subset > > > > > > > > of > > > > > > > > >> > the metadata extracted from the parquet files. The > cache > > > file > > > > > can > > > > > > > get > > > > > > > > >> > really large .. a few GBs for a few hundred thousand > > files. > > > > > > > > >> > I have filed a separate JIRA: DRILL-3973 for profiling > the > > > > > various > > > > > > > > >> aspects > > > > > > > > >> > of planning including metadata operations. In the > > meantime, > > > > the > > > > > > > > >> timestamps > > > > > > > > >> > in the drillbit.log output indicate a large chunk of > time > > > > spent > > > > > in > > > > > > > > >> creating > > > > > > > > >> > the drill table to begin with, which indicates > bottleneck > > in > > > > > > reading > > > > > > > > the > > > > > > > > >> > metadata. (I can provide performance numbers later once > > we > > > > > > confirm > > > > > > > > >> through > > > > > > > > >> > profiling). > > > > > > > > >> > > > > > > > > > >> > A few thoughts around improvements: > > > > > > > > >> > - The jackson deserialization of the JSON file is very > > > slow.. > > > > > can > > > > > > > > this > > > > > > > > >> be > > > > > > > > >> > speeded up ? .. for instance the AfterBurner module of > > > jackson > > > > > > > claims > > > > > > > > to > > > > > > > > >> > improve performance by 30-40% by avoiding the use of > > > > reflection. > > > > > > > > >> > - The cache file read is a single threaded process. If > > we > > > > were > > > > > > > > >> directly > > > > > > > > >> > reading from parquet files, we use a default of 16 > > threads. > > > > > What > > > > > > > can > > > > > > > > be > > > > > > > > >> > done to parallelize the read ? > > > > > > > > >> > - Any operation that can be done one time during the > > > REFRESH > > > > > > > METADATA > > > > > > > > >> > command ? for instance..examining the min/max values to > > > > > determine > > > > > > > > >> > single-value for partition column could be eliminated if > > we > > > do > > > > > > this > > > > > > > > >> > computation during REFRESH METADATA command and store > the > > > > > summary > > > > > > > one > > > > > > > > >> time. > > > > > > > > >> > > > > > > > > > >> > - A pertinent question is: should the cache file be > > stored > > > > in a > > > > > > > more > > > > > > > > >> > efficient format such as Parquet instead of JSON ? > > > > > > > > >> > > > > > > > > > >> > Aman > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
