Re: [DISCUSS] Ideas to improve metadata cache read performance

Aman Sinha Sun, 25 Oct 2015 09:34:16 -0700

Forgot to include the link for Jackson's AfterBurner module:
  https://github.com/FasterXML/jackson-module-afterburner


On Sun, Oct 25, 2015 at 9:28 AM, Aman Sinha <[email protected]> wrote:

> I was going to file an enhancement JIRA but thought I will discuss here
> first:
>
> The parquet metadata cache file is a JSON file that contains a subset of
> the metadata extracted from the parquet files.  The cache file can get
> really large .. a few GBs for a few hundred thousand files.
> I have filed a separate JIRA: DRILL-3973 for profiling the various aspects
> of planning including metadata operations.  In the meantime, the timestamps
> in the drillbit.log output indicate a large chunk of time spent in creating
> the drill table to begin with, which indicates bottleneck in reading the
> metadata.  (I can provide performance numbers later once we confirm through
> profiling).
>
> A few thoughts around improvements:
>  - The jackson deserialization of the JSON file is very slow.. can this be
> speeded up ? .. for instance the AfterBurner module of jackson claims to
> improve performance by 30-40% by avoiding the use of reflection.
>  - The cache file read is a single threaded process.  If we were directly
> reading from parquet files, we use a default of 16 threads.  What can be
> done to parallelize the read ?
>  - Any operation that can be done one time during the REFRESH METADATA
> command ?  for instance..examining the min/max values to determine
> single-value for partition column could be eliminated if we do this
> computation during REFRESH METADATA command and store the summary one time.
>
>  - A pertinent question is: should the cache file be stored in a more
> efficient format such as Parquet instead of JSON ?
>
> Aman
>
>

Re: [DISCUSS] Ideas to improve metadata cache read performance

Reply via email to