I was going to file an enhancement JIRA but thought I will discuss here first:
The parquet metadata cache file is a JSON file that contains a subset of the metadata extracted from the parquet files. The cache file can get really large .. a few GBs for a few hundred thousand files. I have filed a separate JIRA: DRILL-3973 for profiling the various aspects of planning including metadata operations. In the meantime, the timestamps in the drillbit.log output indicate a large chunk of time spent in creating the drill table to begin with, which indicates bottleneck in reading the metadata. (I can provide performance numbers later once we confirm through profiling). A few thoughts around improvements: - The jackson deserialization of the JSON file is very slow.. can this be speeded up ? .. for instance the AfterBurner module of jackson claims to improve performance by 30-40% by avoiding the use of reflection. - The cache file read is a single threaded process. If we were directly reading from parquet files, we use a default of 16 threads. What can be done to parallelize the read ? - Any operation that can be done one time during the REFRESH METADATA command ? for instance..examining the min/max values to determine single-value for partition column could be eliminated if we do this computation during REFRESH METADATA command and store the summary one time. - A pertinent question is: should the cache file be stored in a more efficient format such as Parquet instead of JSON ? Aman
