punkeel commented on issue #10243:
URL: https://github.com/apache/arrow-rs/issues/10243#issuecomment-4855241417

   This lines up with what I see on the original production file too.
   
   I instrumented the same path locally against the original bad Parquet file. 
I still can’t share that file because it is ~90MiB and contains PII, but the 
debug output confirms this is not just an artifact of the minimized repro. πŸ‘  
(aka: is generated by a perfectly normal parquet go program, not messing with 
header/trailer values)
   
   On the original file, the instrumented reader shows the same pattern you 
identified:
   
   ```text
   [probe] at_record_boundary: next_page num_rows=Some(148) 
num_levels=Some(620000) is_dict=false -> returns true
   [probe] DataPageV2 ... num_rows=148 num_values=620000 
rep_levels_byte_len=658 first_rep_levels=[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1] levels_read=620000
   ```
   
   
   So in the real file too, `at_record_boundary` returns `true` because the 
next page is DataPageV2, but that next page starts with repetition level 1, 
meaning it is continuing the list from the previous page.
   I saw this many times in the original file. So the tiny repro appears to be 
faithfully capturing the same real-world shape, not creating a weird 
synthetic-only case. πŸŽ‰ 
   
   Happy to run additional instrumentation against the original file if that 
would help.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to