I am not familiar with the contents of metadata stored but if deserialization workload seems to be fitting to any of afterburner's claimed improvement points [1] It could well be worth trying given the claimed gain on throughput is substantial.
It could also be a good idea to partition caching over a number of files for better parallelization given number of cache files generated is *significantly* less than number of parquet files. Maintaining global statistics seems an improvement point too. -H+ 1: https://github.com/FasterXML/jackson-module-afterburner#what-is-optimized On Sun, Oct 25, 2015 at 9:33 AM, Aman Sinha <[email protected]> wrote: > Forgot to include the link for Jackson's AfterBurner module: > https://github.com/FasterXML/jackson-module-afterburner > > On Sun, Oct 25, 2015 at 9:28 AM, Aman Sinha <[email protected]> wrote: > > > I was going to file an enhancement JIRA but thought I will discuss here > > first: > > > > The parquet metadata cache file is a JSON file that contains a subset of > > the metadata extracted from the parquet files. The cache file can get > > really large .. a few GBs for a few hundred thousand files. > > I have filed a separate JIRA: DRILL-3973 for profiling the various > aspects > > of planning including metadata operations. In the meantime, the > timestamps > > in the drillbit.log output indicate a large chunk of time spent in > creating > > the drill table to begin with, which indicates bottleneck in reading the > > metadata. (I can provide performance numbers later once we confirm > through > > profiling). > > > > A few thoughts around improvements: > > - The jackson deserialization of the JSON file is very slow.. can this > be > > speeded up ? .. for instance the AfterBurner module of jackson claims to > > improve performance by 30-40% by avoiding the use of reflection. > > - The cache file read is a single threaded process. If we were directly > > reading from parquet files, we use a default of 16 threads. What can be > > done to parallelize the read ? > > - Any operation that can be done one time during the REFRESH METADATA > > command ? for instance..examining the min/max values to determine > > single-value for partition column could be eliminated if we do this > > computation during REFRESH METADATA command and store the summary one > time. > > > > - A pertinent question is: should the cache file be stored in a more > > efficient format such as Parquet instead of JSON ? > > > > Aman > > > > >
