jonded94 commented on issue #7489:
URL: https://github.com/apache/arrow-rs/issues/7489#issuecomment-2936182415
@etseidl @alamb it seems like the issue is fixed now! :) Now that #7555 is
merged, I checked out `arrow` and `parquet` with the recent git version:
```
arrow = { git = "https://github.com/apache/arrow-rs.git", rev =
"0ae9f66d10141c8d5054fd77f73168c7a2ea2819", features = ["pyarrow"] }
parquet = { git = "https://github.com/apache/arrow-rs.git", rev =
"0ae9f66d10141c8d5054fd77f73168c7a2ea2819", features = ["async"] }
```
In the previously linked `pytest` test, I added cases where I truncate
statistics to 7Mi (slightly below the threshold after which `pyarrow` crashes
on reads, at least with `EnabledStatistics` == `PAGE`) and also 8Mi (leading to
`pyarrow` crashes).
Not only are file sizes now actually changing, but the `pyarrow` crashes are
exactly as expected (i.e. only for page level statistics longer than 8MiB per
value)! 😁
Here is an overview over file sizes & `pyarrow` crashes for generated test
files:
| EnabledStatistics | statistics_truncate_length | size [bytes] | fails
with `pyarrow` |
| --- | --- | --- | --- |
| None | None |787344 | No |
| Chunk | 1024 |789415 | No |
| Chunk | 7Mi |15467435 | No |
| Chunk | 8Mi |17564587 | No |
| Chunk | None |34341803 | No |
| Page | 1024 |793552 | No |
| Page | 7Mi |44827620 | No |
| Page | 8Mi |51119076 | Yes |
| Page | None |101450724 | Yes |
The file sizes seem like there really should be a
`statistics_truncate_length` default somewhere below or equal to 1024 or so.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]