punkeel commented on issue #10243: URL: https://github.com/apache/arrow-rs/issues/10243#issuecomment-4855241417
This lines up with what I see on the original production file too. I instrumented the same path locally against the original bad Parquet file. I still canβt share that file because it is ~90MiB and contains PII, but the debug output confirms this is not just an artifact of the minimized repro. π (aka: is generated by a perfectly normal parquet go program, not messing with header/trailer values) On the original file, the instrumented reader shows the same pattern you identified: ```text [probe] at_record_boundary: next_page num_rows=Some(148) num_levels=Some(620000) is_dict=false -> returns true [probe] DataPageV2 ... num_rows=148 num_values=620000 rep_levels_byte_len=658 first_rep_levels=[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1] levels_read=620000 ``` So in the real file too, `at_record_boundary` returns `true` because the next page is DataPageV2, but that next page starts with repetition level 1, meaning it is continuing the list from the previous page. I saw this many times in the original file. So the tiny repro appears to be faithfully capturing the same real-world shape, not creating a weird synthetic-only case. π Happy to run additional instrumentation against the original file if that would help. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
