Re: [I] [C++] Dataset filtering with `<` behaving incorrectly on uint16 column in Parquet file written by parquet-go but not parquet-cpp [arrow]

via GitHub Wed, 12 Nov 2025 07:59:18 -0800


shaunak-pusalkar commented on issue #46391:
URL: https://github.com/apache/arrow/issues/46391#issuecomment-3522671301


   Here’s an update on this issue after a series of tests we ran on the full 
dataset.
   
   Summary of what we tried:
   
   Loaded the parquet-go–written files fully into memory using PyArrow.
   
   Converted all unsigned integer columns to signed int64.
   
   Retained floating-point columns (e.g., certainty_score) without 
conversion.(to avoid data loss)
   
   Rewrote the Parquet files with:
   
   compression="gzip",
   version="2.6",
   write_statistics=True,
   coerce_timestamps="ms",
   use_deprecated_int96_timestamps=False
   
   
   Result:
   After rewriting, the filtering issue with < completely disappeared.
   The rewritten datasets behave correctly under open_dataset() and match the 
results of read_parquet().
   
   Interpretation:
   This strongly suggests that the root cause was incorrect or incompatible 
min/max statistics in parquet-go’s unsigned integer encodings.
   Rewriting forces PyArrow to regenerate correct statistics and encoding 
metadata, resolving the filter pushdown error.
   
   
   Thanks @mapleFU and @thisisnic and @jameshowison for earlier suggestions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [C++] Dataset filtering with `<` behaving incorrectly on uint16 column in Parquet file written by parquet-go but not parquet-cpp [arrow]

Reply via email to