etseidl opened a new issue, #3680:
URL: https://github.com/apache/arrow-rs/issues/3680

   **Which part is this question about**
   Parquet writer
   <!--
   Is it code base, library api, documentation or some other part?
   -->
   
   **Describe your question**
   In #1777 it was brought up 
[here](https://github.com/apache/arrow-rs/issues/1777#issuecomment-1147686956) 
that the Parquet spec seems to require that pages begin on record boundaries 
when writing offset indices.  Additionally, the same can be said for V2 page 
headers (see comment in the parquet-format 
[thrift](https://github.com/apache/parquet-format/blob/5205dc7b7c0b910ea6af33cadbd2963c0c47c726/src/main/thrift/parquet.thrift#L564)
 file). It appears that this reasoning was rejected, and the Parquet writer 
continues to write files where rows can span multiple pages.  I'm wondering if 
this should still be considered a bug given how difficult finding individual 
rows is made with this behavior in place.
   <!--
   A clear and concise description of what the question is.
   -->
   
   **Additional context**
   I've been working with the cuDF Parquet reader, and files with large nested 
rows can create havoc when rows span pages.  Parquet-mr appears to hew to the 
"pages start with R=0" rule.
   <!--
   Add any other context about the problem here.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to