wgtmac commented on PR #34054:
URL: https://github.com/apache/arrow/pull/34054#issuecomment-1426589075

   > @wgtmac another issue to consider as you implement the page index is rows 
that span multiple pages. With nested columns, it is possible to have single 
rows that are so large that they exceed the requested page size. arrow-cpp 
currently will honor the page size by splitting these rows across multiple 
pages. The current parquet spec, however, seems to require that pages begin at 
row boundaries (i.e. the repetition level R is 0 for the first value in each 
page, see 
[here](https://github.com/apache/parquet-format/blob/5205dc7b7c0b910ea6af33cadbd2963c0c47c726/src/main/thrift/parquet.thrift#L564)
 and 
[here](https://github.com/apache/parquet-format/blob/5205dc7b7c0b910ea6af33cadbd2963c0c47c726/src/main/thrift/parquet.thrift#L918)).
 Do you concur and think this should be another blocking issue or part of this 
PR?
   
   Thanks for the information. @etseidl 
   
   Yes I have already noticed that a record may span across different pages. 
But in the parquet-cpp, the page size check always happens at the end of each 
batch. Therefore it guarantees that a page will not split any record. Please 
check this function as well as where it is called for reference: 
https://github.com/apache/arrow/blob/master/cpp/src/parquet/column_writer.cc#L1376
 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to