samuelcolvin commented on issue #6310:
URL: https://github.com/apache/arrow-rs/issues/6310#issuecomment-2312724126

   > my guess is that the problem comes because either num_buffered_rows or 
num_page_nulls is wrong in the last page, hence
   
   Okay, ignore that suggestion. I've done some more digging and have a bit of 
progress, the key point from above is
   
   ```
       definition_level_histograms: Some(
           [
               ...
               7677,
               30,
           ],
   ```
   
   I think this is saying that the last page has 7677 null values (which 
matches `null_counts`), and 30 non-null values.
   
   Sure enough, if I run `select count(*) from 'bad.parquet' where process_pid 
is not null;` on the parquet file (`process_pid` is the problematic column), I 
get the result `30`! All 30 non-null values are `1`.
   
   I guess the next step is to build a parquet file with a `UInt32` column 
that's mostly null except one page, and see if we can reproduce the problem.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to