shaunak-pusalkar commented on issue #46391:
URL: https://github.com/apache/arrow/issues/46391#issuecomment-3522671301

   Here’s an update on this issue after a series of tests we ran on the full 
dataset.
   
   Summary of what we tried:
   
   Loaded the parquet-go–written files fully into memory using PyArrow.
   
   Converted all unsigned integer columns to signed int64.
   
   Retained floating-point columns (e.g., certainty_score) without 
conversion.(to avoid data loss)
   
   Rewrote the Parquet files with:
   
   compression="gzip",
   version="2.6",
   write_statistics=True,
   coerce_timestamps="ms",
   use_deprecated_int96_timestamps=False
   
   
   Result:
   After rewriting, the filtering issue with < completely disappeared.
   The rewritten datasets behave correctly under open_dataset() and match the 
results of read_parquet().
   
   Interpretation:
   This strongly suggests that the root cause was incorrect or incompatible 
min/max statistics in parquet-go’s unsigned integer encodings.
   Rewriting forces PyArrow to regenerate correct statistics and encoding 
metadata, resolving the filter pushdown error.
   
   
   Thanks @mapleFU and @thisisnic and @jameshowison for earlier suggestions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to