Fokko commented on issue #7067: URL: https://github.com/apache/iceberg/issues/7067#issuecomment-1665449121
I agree with you there, but it happens after the filtering, so PyIceberg will already prune the unrelated files, and filter out the unrelated data. The problem with `to_pyarrow_dataset` is that Iceberg has much more sophisticated pruning, that can happen at different levels, and this cannot be expressed in arrow fragments. We're looking into adding substrait integration for PyIceberg, where we could express this, but this is further along. With [Iceberg's hidden partitioning](https://iceberg.apache.org/docs/latest/partitioning/), we don't have to do things like: ``` Use the `pyarrow_options` parameter to read only certain partitions. >>> pl.scan_delta( # doctest: +SKIP ... table_path, ... pyarrow_options={"partitions": [("year", "=", "2021")]}, ... ) ``` Which I think is very user-unfriendly, because if you don't pass the partition, it will also cause Polars to read too much data. Since `pl.LazyFrame._scan_python_function(...)` only will be called on an action, and that passes in the filter-predicate, I think we're fine. Or am I missing something? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
