Forgot to include the link for Jackson's AfterBurner module: https://github.com/FasterXML/jackson-module-afterburner
On Sun, Oct 25, 2015 at 9:28 AM, Aman Sinha <[email protected]> wrote: > I was going to file an enhancement JIRA but thought I will discuss here > first: > > The parquet metadata cache file is a JSON file that contains a subset of > the metadata extracted from the parquet files. The cache file can get > really large .. a few GBs for a few hundred thousand files. > I have filed a separate JIRA: DRILL-3973 for profiling the various aspects > of planning including metadata operations. In the meantime, the timestamps > in the drillbit.log output indicate a large chunk of time spent in creating > the drill table to begin with, which indicates bottleneck in reading the > metadata. (I can provide performance numbers later once we confirm through > profiling). > > A few thoughts around improvements: > - The jackson deserialization of the JSON file is very slow.. can this be > speeded up ? .. for instance the AfterBurner module of jackson claims to > improve performance by 30-40% by avoiding the use of reflection. > - The cache file read is a single threaded process. If we were directly > reading from parquet files, we use a default of 16 threads. What can be > done to parallelize the read ? > - Any operation that can be done one time during the REFRESH METADATA > command ? for instance..examining the min/max values to determine > single-value for partition column could be eliminated if we do this > computation during REFRESH METADATA command and store the summary one time. > > - A pertinent question is: should the cache file be stored in a more > efficient format such as Parquet instead of JSON ? > > Aman > >
