ConeyLiu opened a new pull request, #5632:
URL: https://github.com/apache/iceberg/pull/5632

   This patch aims to reduce the time of Iceberg task planning. We notice the 
task plan could not benefit from ParallelIterator a lot when there are many 
manifest files to read. The problem is the ManifestReader needs to read the 
manifest file to get the PartitionSpec which is not needed for most cases 
(because the ManifestFile object has the PartitionSpec Id and the Table has the 
mapping from PartitionSpec Id to PartitionSpec). It needs to read all these 
manifest files in serial when the following code of ParallelIterator is 
initing. And this is very slow when the HDFS traffic is busy. 
   
   
https://github.com/apache/iceberg/blob/dbb8a404f6632a55acb36e949f0e7b84b643cede/core/src/main/java/org/apache/iceberg/util/ParallelIterable.java#L62
   
   After this change, we could get several times (depending on the number of 
driver threads and count of manifest files) time reduction.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to