rizaon commented on PR #4518:
URL: https://github.com/apache/iceberg/pull/4518#issuecomment-1095389667

   Hi @rdblue, thank you for your feedback.
   
   We found a slow query compilation issue against the Iceberg table in our 
recent Apache Impala build. Impala uses Iceberg's HiveCatalog and HadoopFileIO 
instance with an S3A input stream to access data from S3. We did a full 10 TB 
TPC-DS benchmark and found that query compilation can go for several seconds, 
while it used to be less than a second with native hive tables. This slowness 
in single query compilation is due to the requirement to call planFiles several 
times, even for scan nodes targetting the same table. We also see several 
socket read operations that spend hundreds of milliseconds during planFiles, 
presumably due to S3A HTTP HEAD request overhead and backward seek overhead 
(issue #4508). This is especially hurt fast-running queries.
   
   We tried this caching solution and it help speed up Impala query compilation 
almost 5x faster compared to without on Iceberg tables. Our original solution, 
however, is to put a Caffeine cache as a singleton in AvroIO.java. I thought it 
is better to supply the cache from outside.
   
   I have not considered the solution of adding a caching FileIO instance. I'm 
pretty new to the Iceberg codebase but interested to follow up on that if it 
can yield a better integration. Will it require a new class of Catalog/Table as 
well, or can we improve on the existing HiveCatalog & HadoopFileIO?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to