[GitHub] [iceberg] RussellSpitzer opened a new issue #3532: Spark: Pushdown Filters / Improve Performance when Importing File Based Tables

GitBox Thu, 11 Nov 2021 08:25:23 -0800


RussellSpitzer opened a new issue #3532:
URL: https://github.com/apache/iceberg/issues/3532



   Currently several we rely on several Spark internal classes when attempting 
to list the contents of file based tables for various of our migrate/add_file 
functions.
   
   See
   
   
https://github.com/apache/iceberg/blob/f5a753791f4dc6aca78569a14f731feda9edf462/spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/Spark3Util.java#L810-L854
   
   The cost of this operation scales directly with the number of files/folders 
in the table irregardless of the actual partition filter we are applying. It 
may make sense to attempt to pushdown the filters being used in the operation 
(in case of add_files) or do the listing in a more economical way. 
   
   I don't have a good plan for doing this at the moment since our code is so 
reliant on Spark to achieve the listing but I assume we can do better.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] RussellSpitzer opened a new issue #3532: Spark: Pushdown Filters / Improve Performance when Importing File Based Tables

Reply via email to