jonded94 commented on issue #7489: URL: https://github.com/apache/arrow-rs/issues/7489#issuecomment-2936182415
@etseidl @alamb it seems like the issue is fixed now! :) Now that #7555 is merged, I checked out `arrow` and `parquet` with the recent git version: ``` arrow = { git = "https://github.com/apache/arrow-rs.git", rev = "0ae9f66d10141c8d5054fd77f73168c7a2ea2819", features = ["pyarrow"] } parquet = { git = "https://github.com/apache/arrow-rs.git", rev = "0ae9f66d10141c8d5054fd77f73168c7a2ea2819", features = ["async"] } ``` In the previously linked `pytest` test, I added cases where I truncate statistics to 7Mi (slightly below the threshold after which `pyarrow` crashes on reads, at least with `EnabledStatistics` == `PAGE`) and also 8Mi (leading to `pyarrow` crashes). Not only are file sizes now actually changing, but the `pyarrow` crashes are exactly as expected (i.e. only for page level statistics longer than 8MiB per value)! 😁 Here is an overview over file sizes & `pyarrow` crashes for generated test files: | EnabledStatistics | statistics_truncate_length | size [bytes] | fails with `pyarrow` | | --- | --- | --- | --- | | None | None |787344 | No | | Chunk | 1024 |789415 | No | | Chunk | 7Mi |15467435 | No | | Chunk | 8Mi |17564587 | No | | Chunk | None |34341803 | No | | Page | 1024 |793552 | No | | Page | 7Mi |44827620 | No | | Page | 8Mi |51119076 | Yes | | Page | None |101450724 | Yes | The file sizes seem like there really should be a `statistics_truncate_length` default somewhere below or equal to 1024 or so. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org