etseidl opened a new issue, #3680: URL: https://github.com/apache/arrow-rs/issues/3680
**Which part is this question about** Parquet writer <!-- Is it code base, library api, documentation or some other part? --> **Describe your question** In #1777 it was brought up [here](https://github.com/apache/arrow-rs/issues/1777#issuecomment-1147686956) that the Parquet spec seems to require that pages begin on record boundaries when writing offset indices. Additionally, the same can be said for V2 page headers (see comment in the parquet-format [thrift](https://github.com/apache/parquet-format/blob/5205dc7b7c0b910ea6af33cadbd2963c0c47c726/src/main/thrift/parquet.thrift#L564) file). It appears that this reasoning was rejected, and the Parquet writer continues to write files where rows can span multiple pages. I'm wondering if this should still be considered a bug given how difficult finding individual rows is made with this behavior in place. <!-- A clear and concise description of what the question is. --> **Additional context** I've been working with the cuDF Parquet reader, and files with large nested rows can create havoc when rows span pages. Parquet-mr appears to hew to the "pages start with R=0" rule. <!-- Add any other context about the problem here. --> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
