RussellSpitzer opened a new issue #3532: URL: https://github.com/apache/iceberg/issues/3532
Currently several we rely on several Spark internal classes when attempting to list the contents of file based tables for various of our migrate/add_file functions. See https://github.com/apache/iceberg/blob/f5a753791f4dc6aca78569a14f731feda9edf462/spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/Spark3Util.java#L810-L854 The cost of this operation scales directly with the number of files/folders in the table irregardless of the actual partition filter we are applying. It may make sense to attempt to pushdown the filters being used in the operation (in case of add_files) or do the listing in a more economical way. I don't have a good plan for doing this at the moment since our code is so reliant on Spark to achieve the listing but I assume we can do better. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
