ConeyLiu opened a new pull request, #5632: URL: https://github.com/apache/iceberg/pull/5632
This patch aims to reduce the time of Iceberg task planning. We notice the task plan could not benefit from ParallelIterator a lot when there are many manifest files to read. The problem is the ManifestReader needs to read the manifest file to get the PartitionSpec which is not needed for most cases (because the ManifestFile object has the PartitionSpec Id and the Table has the mapping from PartitionSpec Id to PartitionSpec). It needs to read all these manifest files in serial when the following code of ParallelIterator is initing. And this is very slow when the HDFS traffic is busy. https://github.com/apache/iceberg/blob/dbb8a404f6632a55acb36e949f0e7b84b643cede/core/src/main/java/org/apache/iceberg/util/ParallelIterable.java#L62 After this change, we could get several times (depending on the number of driver threads and count of manifest files) time reduction. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
