[GitHub] [arrow-rs] etseidl opened a new issue, #3680: Should Parquet pages begin on the start of a row?

via GitHub Wed, 08 Feb 2023 17:55:40 -0800


etseidl opened a new issue, #3680:
URL: https://github.com/apache/arrow-rs/issues/3680

**Which part is this question about**
Parquet writer

**Describe your question**
In #1777 it was brought up
[here](https://github.com/apache/arrow-rs/issues/1777#issuecomment-1147686956)
that the Parquet spec seems to require that pages begin on record boundaries
when writing offset indices. Additionally, the same can be said for V2 page
headers (see comment in the parquet-format
[thrift](https://github.com/apache/parquet-format/blob/5205dc7b7c0b910ea6af33cadbd2963c0c47c726/src/main/thrift/parquet.thrift#L564)
file). It appears that this reasoning was rejected, and the Parquet writer
continues to write files where rows can span multiple pages. I'm wondering if
this should still be considered a bug given how difficult finding individual
rows is made with this behavior in place.

**Additional context**
I've been working with the cuDF Parquet reader, and files with large nested
rows can create havoc when rows span pages. Parquet-mr appears to hew to the
"pages start with R=0" rule.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-rs] etseidl opened a new issue, #3680: Should Parquet pages begin on the start of a row?

Reply via email to