jonded94 commented on issue #7489:
URL: https://github.com/apache/arrow-rs/issues/7489#issuecomment-2936182415

   @etseidl @alamb it seems like the issue is fixed now! :) Now that #7555 is 
merged, I checked out `arrow` and `parquet` with the recent git version:
   ```
   arrow = { git = "https://github.com/apache/arrow-rs.git";, rev = 
"0ae9f66d10141c8d5054fd77f73168c7a2ea2819", features = ["pyarrow"] }
   parquet = { git = "https://github.com/apache/arrow-rs.git";, rev = 
"0ae9f66d10141c8d5054fd77f73168c7a2ea2819", features = ["async"] }
   ```
   
   In the previously linked `pytest` test, I added cases where I truncate 
statistics to 7Mi (slightly below the threshold after which `pyarrow` crashes 
on reads, at least with `EnabledStatistics` == `PAGE`) and also 8Mi (leading to 
`pyarrow` crashes).
   
   Not only are file sizes now actually changing, but the `pyarrow` crashes are 
exactly as expected (i.e. only for page level statistics longer than 8MiB per 
value)! 😁 
   
   Here is an overview over file sizes & `pyarrow` crashes for generated test 
files:
   
   | EnabledStatistics  | statistics_truncate_length | size [bytes] | fails 
with `pyarrow` |
   | --- | --- | --- | --- |
   | None | None |787344 | No |
   | Chunk | 1024 |789415 | No |
   | Chunk | 7Mi |15467435 | No |
   | Chunk | 8Mi |17564587 | No |
   | Chunk | None |34341803 | No |
   | Page | 1024 |793552 | No |
   | Page | 7Mi |44827620 | No |
   | Page | 8Mi |51119076 | Yes |
   | Page | None |101450724 | Yes |
   
   The file sizes seem like there really should be a 
`statistics_truncate_length` default somewhere below or equal to 1024 or so.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to