I'm collecting some data on what is a reasonable matrix. I think you're right that there may be a set of cases where Steven's parallelization approach makes more sense and some where it might be more expensive. I'll post something with information about the use cases in the next few days.
On Wed, Nov 4, 2015 at 4:15 PM, Jacques Nadeau <[email protected]> wrote: > I definitely think we should subdivide into a couple different spaces: > > XX files where we only need to query N (via partition pruning) > XX files where we need to query all or most of the files > > where XX is [10, 10M, 10MM] and N is [100%, 0.0001%] > > Is that a reasonable matrix? > > I always don't want to make everything more complicated for the > 10MM/0.0001% case at the cost of the other cases as I'm not sure how common > that case is. > > I also think that Steven's parallelization comments are more applicable to > some of these than others. > > > > -- > Jacques Nadeau > CTO and Co-Founder, Dremio > > On Wed, Nov 4, 2015 at 3:59 PM, Parth Chandra <[email protected]> wrote: > > > It's been on my things to try. I'll give it a shot. At the moment, I've > got > > the file size reduced to 40% of the previous. With some other changes > this > > might give us good enough performance, but it would be far from ideal. > > > > Any other thoughts are welcome. I'll run some experiments and then write > up > > a proposal. > > > > > > > > On Wed, Nov 4, 2015 at 9:15 AM, Aman Sinha <[email protected]> wrote: > > > > > It would be good to try; however I recall that we encountered a > > > SchemaChangeException when querying the JSON cache file. Parth might > > have > > > more success once he has simplified the metadata. > > > > > > Aman > > > > > > On Wed, Nov 4, 2015 at 8:31 AM, Jacques Nadeau <[email protected]> > > wrote: > > > > > > > I've been thinking more about this and I think Aman's suggestion of > > > Parquet > > > > files is worth a poc. > > > > > > > > What we could do: > > > > > > > > Run a select * order by partCol1, partCol2, ... , partColN query > > against > > > > the existing large json partition file and create a new Parquet > version > > > of > > > > the file. > > > > Hand write a partition type read against the Parquet APIs using the > > > filter > > > > APIs and see what performance looks like. > > > > > > > > Thoughts? > > > > > > > > -- > > > > Jacques Nadeau > > > > CTO and Co-Founder, Dremio > > > > > > > > On Fri, Oct 30, 2015 at 3:36 PM, Parth Chandra <[email protected]> > > > wrote: > > > > > > > > > Thanks Steven for the link. > > > > > Your suggestion of storing only the single valued columns is a good > > > one. > > > > > It might be OK to have some of the count* queries run a little > slower > > > as > > > > > reading the cache itself is taking way to long. I'm also looking > at > > > > > squashing the column datatype info as there is a lot of redundancy > > > there. > > > > > > > > > > > > > > > > > > > > > > > > > On Fri, Oct 30, 2015 at 3:22 PM, Steven Phillips < > [email protected]> > > > > > wrote: > > > > > > > > > > > My view on storing it in some other format is that, yes, it will > > > > probably > > > > > > reduce the size of the file, but if we gzip the json file, it > > should > > > be > > > > > > pretty compact. As for deserialization cost, other formats would > be > > > > > faster, > > > > > > but not dramatically faster. Certainly not the order of magnitude > > > > faster > > > > > > that we really need it to be. The reason we chose JSON was > because > > it > > > > is > > > > > > readable and easier to deal with. > > > > > > > > > > > > As for the old code, I can point you at a branch, but it's > probably > > > not > > > > > > very helpful. Unless we want to essentially disable value-based > > > > partition > > > > > > pruning when using the cache, the old code will not work. > > > > > > > > > > > > My recommendation would be to come up with a new version of the > > > format > > > > > > which stores only the name and value of columns which are > > > single-valued > > > > > for > > > > > > each file or row group. This will allow partition pruning to > work, > > > but > > > > > some > > > > > > count queries may not be as fast any more, because the cache > won't > > > have > > > > > > column value counts on a per-rowgroup basis any more. > > > > > > > > > > > > Anyway, here is the link to the original branch. > > > > > > > > > > > > https://github.com/StevenMPhillips/drill/tree/meta > > > > > > > > > > > > On Fri, Oct 30, 2015 at 3:01 PM, Parth Chandra < > [email protected]> > > > > > wrote: > > > > > > > > > > > > > Hey Jacques, Steven, > > > > > > > > > > > > > > Do we have a branch somewhere which has the initial prototype > > > code? > > > > > I'd > > > > > > > like to prune the file a bit as it looks like reducing the size > > of > > > > the > > > > > > > metadata cache file might yield the best results. > > > > > > > > > > > > > > Also, did we have a particular reason for going with JSON as > > > > opposed > > > > > > to a > > > > > > > more compact binary format? Are there any arguments against > > saving > > > > this > > > > > > as > > > > > > > a protobuf/BSON/Parquet file? > > > > > > > > > > > > > > Parth > > > > > > > > > > > > > > On Mon, Oct 26, 2015 at 2:42 PM, Jacques Nadeau < > > > [email protected]> > > > > > > > wrote: > > > > > > > > > > > > > > > My first thought is we've gotten too generous in what we're > > > storing > > > > > in > > > > > > > the > > > > > > > > Parquet metadata file. Early implementations were very lean > and > > > it > > > > > > seems > > > > > > > > far larger today. For example, early implementations didn't > > keep > > > > > > > statistics > > > > > > > > and ignored row groups (files, schema and block locations > > only). > > > If > > > > > we > > > > > > > need > > > > > > > > multiple levels of information, we may want to stagger (or > > > > normalize) > > > > > > > them > > > > > > > > in the file. Also, we may think about what is the minimum > that > > > must > > > > > be > > > > > > > done > > > > > > > > in planning. We could do the file pruning at execution time > > > rather > > > > > than > > > > > > > > single-tracking these things (makes stats harder though). > > > > > > > > > > > > > > > > I also think we should be cautious around jumping to a > > conclusion > > > > > until > > > > > > > > DRILL-3973 provides more insight. > > > > > > > > > > > > > > > > In terms of caching, I'd be more inclined to rely on file > > system > > > > > > caching > > > > > > > > and make sure serialization/deserialization is as efficient > as > > > > > possible > > > > > > > as > > > > > > > > opposed to implementing an application-level cache. (We > already > > > > have > > > > > > > enough > > > > > > > > problems managing memory without having to figure out when we > > > > should > > > > > > > drop a > > > > > > > > metadata cache :D). > > > > > > > > > > > > > > > > Aside, I always liked this post for entertainment and the > > > thoughts > > > > on > > > > > > > > virtual memory: > > > > > https://www.varnish-cache.org/trac/wiki/ArchitectNotes > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > Jacques Nadeau > > > > > > > > CTO and Co-Founder, Dremio > > > > > > > > > > > > > > > > On Mon, Oct 26, 2015 at 2:25 PM, Hanifi Gunes < > > > [email protected] > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > One more thing, for workloads running queries over subsets > of > > > > same > > > > > > > > parquet > > > > > > > > > files, we can consider maintaining an in-memory cache as > > well. > > > > > > Assuming > > > > > > > > > metadata memory footprint per file is low and parquet files > > are > > > > > > static, > > > > > > > > not > > > > > > > > > needing us to invalidate the cache often. > > > > > > > > > > > > > > > > > > H+ > > > > > > > > > > > > > > > > > > On Mon, Oct 26, 2015 at 2:10 PM, Hanifi Gunes < > > > > [email protected] > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > I am not familiar with the contents of metadata stored > but > > if > > > > > > > > > > deserialization workload seems to be fitting to any of > > > > > > afterburner's > > > > > > > > > > claimed improvement points [1] It could well be worth > > trying > > > > > given > > > > > > > the > > > > > > > > > > claimed gain on throughput is substantial. > > > > > > > > > > > > > > > > > > > > It could also be a good idea to partition caching over a > > > number > > > > > of > > > > > > > > files > > > > > > > > > > for better parallelization given number of cache files > > > > generated > > > > > is > > > > > > > > > > *significantly* less than number of parquet files. > > > Maintaining > > > > > > global > > > > > > > > > > statistics seems an improvement point too. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -H+ > > > > > > > > > > > > > > > > > > > > 1: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/FasterXML/jackson-module-afterburner#what-is-optimized > > > > > > > > > > > > > > > > > > > > On Sun, Oct 25, 2015 at 9:33 AM, Aman Sinha < > > > > > [email protected]> > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > >> Forgot to include the link for Jackson's AfterBurner > > module: > > > > > > > > > >> > https://github.com/FasterXML/jackson-module-afterburner > > > > > > > > > >> > > > > > > > > > >> On Sun, Oct 25, 2015 at 9:28 AM, Aman Sinha < > > > > > [email protected] > > > > > > > > > > > > > > > > wrote: > > > > > > > > > >> > > > > > > > > > >> > I was going to file an enhancement JIRA but thought I > > will > > > > > > discuss > > > > > > > > > here > > > > > > > > > >> > first: > > > > > > > > > >> > > > > > > > > > > >> > The parquet metadata cache file is a JSON file that > > > > contains a > > > > > > > > subset > > > > > > > > > of > > > > > > > > > >> > the metadata extracted from the parquet files. The > > cache > > > > file > > > > > > can > > > > > > > > get > > > > > > > > > >> > really large .. a few GBs for a few hundred thousand > > > files. > > > > > > > > > >> > I have filed a separate JIRA: DRILL-3973 for profiling > > the > > > > > > various > > > > > > > > > >> aspects > > > > > > > > > >> > of planning including metadata operations. In the > > > meantime, > > > > > the > > > > > > > > > >> timestamps > > > > > > > > > >> > in the drillbit.log output indicate a large chunk of > > time > > > > > spent > > > > > > in > > > > > > > > > >> creating > > > > > > > > > >> > the drill table to begin with, which indicates > > bottleneck > > > in > > > > > > > reading > > > > > > > > > the > > > > > > > > > >> > metadata. (I can provide performance numbers later > once > > > we > > > > > > > confirm > > > > > > > > > >> through > > > > > > > > > >> > profiling). > > > > > > > > > >> > > > > > > > > > > >> > A few thoughts around improvements: > > > > > > > > > >> > - The jackson deserialization of the JSON file is > very > > > > slow.. > > > > > > can > > > > > > > > > this > > > > > > > > > >> be > > > > > > > > > >> > speeded up ? .. for instance the AfterBurner module of > > > > jackson > > > > > > > > claims > > > > > > > > > to > > > > > > > > > >> > improve performance by 30-40% by avoiding the use of > > > > > reflection. > > > > > > > > > >> > - The cache file read is a single threaded process. > If > > > we > > > > > were > > > > > > > > > >> directly > > > > > > > > > >> > reading from parquet files, we use a default of 16 > > > threads. > > > > > > What > > > > > > > > can > > > > > > > > > be > > > > > > > > > >> > done to parallelize the read ? > > > > > > > > > >> > - Any operation that can be done one time during the > > > > REFRESH > > > > > > > > METADATA > > > > > > > > > >> > command ? for instance..examining the min/max values > to > > > > > > determine > > > > > > > > > >> > single-value for partition column could be eliminated > if > > > we > > > > do > > > > > > > this > > > > > > > > > >> > computation during REFRESH METADATA command and store > > the > > > > > > summary > > > > > > > > one > > > > > > > > > >> time. > > > > > > > > > >> > > > > > > > > > > >> > - A pertinent question is: should the cache file be > > > stored > > > > > in a > > > > > > > > more > > > > > > > > > >> > efficient format such as Parquet instead of JSON ? > > > > > > > > > >> > > > > > > > > > > >> > Aman > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
