shaunak-pusalkar commented on issue #46391: URL: https://github.com/apache/arrow/issues/46391#issuecomment-3522671301
Here’s an update on this issue after a series of tests we ran on the full dataset. Summary of what we tried: Loaded the parquet-go–written files fully into memory using PyArrow. Converted all unsigned integer columns to signed int64. Retained floating-point columns (e.g., certainty_score) without conversion.(to avoid data loss) Rewrote the Parquet files with: compression="gzip", version="2.6", write_statistics=True, coerce_timestamps="ms", use_deprecated_int96_timestamps=False Result: After rewriting, the filtering issue with < completely disappeared. The rewritten datasets behave correctly under open_dataset() and match the results of read_parquet(). Interpretation: This strongly suggests that the root cause was incorrect or incompatible min/max statistics in parquet-go’s unsigned integer encodings. Rewriting forces PyArrow to regenerate correct statistics and encoding metadata, resolving the filter pushdown error. Thanks @mapleFU and @thisisnic and @jameshowison for earlier suggestions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
