[DISCUSS] Ideas to improve metadata cache read performance

Aman Sinha Sun, 25 Oct 2015 09:29:21 -0700

I was going to file an enhancement JIRA but thought I will discuss here
first:


The parquet metadata cache file is a JSON file that contains a subset of
the metadata extracted from the parquet files.  The cache file can get
really large .. a few GBs for a few hundred thousand files.
I have filed a separate JIRA: DRILL-3973 for profiling the various aspects
of planning including metadata operations.  In the meantime, the timestamps
in the drillbit.log output indicate a large chunk of time spent in creating
the drill table to begin with, which indicates bottleneck in reading the
metadata.  (I can provide performance numbers later once we confirm through
profiling).

A few thoughts around improvements:
 - The jackson deserialization of the JSON file is very slow.. can this be
speeded up ? .. for instance the AfterBurner module of jackson claims to
improve performance by 30-40% by avoiding the use of reflection.
 - The cache file read is a single threaded process.  If we were directly
reading from parquet files, we use a default of 16 threads.  What can be
done to parallelize the read ?
 - Any operation that can be done one time during the REFRESH METADATA
command ?  for instance..examining the min/max values to determine
single-value for partition column could be eliminated if we do this
computation during REFRESH METADATA command and store the summary one time.

 - A pertinent question is: should the cache file be stored in a more
efficient format such as Parquet instead of JSON ?

Aman

[DISCUSS] Ideas to improve metadata cache read performance

Reply via email to