One more thing, for workloads running queries over subsets of same parquet files, we can consider maintaining an in-memory cache as well. Assuming metadata memory footprint per file is low and parquet files are static, not needing us to invalidate the cache often.
H+ On Mon, Oct 26, 2015 at 2:10 PM, Hanifi Gunes <[email protected]> wrote: > I am not familiar with the contents of metadata stored but if > deserialization workload seems to be fitting to any of afterburner's > claimed improvement points [1] It could well be worth trying given the > claimed gain on throughput is substantial. > > It could also be a good idea to partition caching over a number of files > for better parallelization given number of cache files generated is > *significantly* less than number of parquet files. Maintaining global > statistics seems an improvement point too. > > > -H+ > > 1: > https://github.com/FasterXML/jackson-module-afterburner#what-is-optimized > > On Sun, Oct 25, 2015 at 9:33 AM, Aman Sinha <[email protected]> wrote: > >> Forgot to include the link for Jackson's AfterBurner module: >> https://github.com/FasterXML/jackson-module-afterburner >> >> On Sun, Oct 25, 2015 at 9:28 AM, Aman Sinha <[email protected]> wrote: >> >> > I was going to file an enhancement JIRA but thought I will discuss here >> > first: >> > >> > The parquet metadata cache file is a JSON file that contains a subset of >> > the metadata extracted from the parquet files. The cache file can get >> > really large .. a few GBs for a few hundred thousand files. >> > I have filed a separate JIRA: DRILL-3973 for profiling the various >> aspects >> > of planning including metadata operations. In the meantime, the >> timestamps >> > in the drillbit.log output indicate a large chunk of time spent in >> creating >> > the drill table to begin with, which indicates bottleneck in reading the >> > metadata. (I can provide performance numbers later once we confirm >> through >> > profiling). >> > >> > A few thoughts around improvements: >> > - The jackson deserialization of the JSON file is very slow.. can this >> be >> > speeded up ? .. for instance the AfterBurner module of jackson claims to >> > improve performance by 30-40% by avoiding the use of reflection. >> > - The cache file read is a single threaded process. If we were >> directly >> > reading from parquet files, we use a default of 16 threads. What can be >> > done to parallelize the read ? >> > - Any operation that can be done one time during the REFRESH METADATA >> > command ? for instance..examining the min/max values to determine >> > single-value for partition column could be eliminated if we do this >> > computation during REFRESH METADATA command and store the summary one >> time. >> > >> > - A pertinent question is: should the cache file be stored in a more >> > efficient format such as Parquet instead of JSON ? >> > >> > Aman >> > >> > >> > >
