wgtmac commented on PR #34054: URL: https://github.com/apache/arrow/pull/34054#issuecomment-1426589075
> @wgtmac another issue to consider as you implement the page index is rows that span multiple pages. With nested columns, it is possible to have single rows that are so large that they exceed the requested page size. arrow-cpp currently will honor the page size by splitting these rows across multiple pages. The current parquet spec, however, seems to require that pages begin at row boundaries (i.e. the repetition level R is 0 for the first value in each page, see [here](https://github.com/apache/parquet-format/blob/5205dc7b7c0b910ea6af33cadbd2963c0c47c726/src/main/thrift/parquet.thrift#L564) and [here](https://github.com/apache/parquet-format/blob/5205dc7b7c0b910ea6af33cadbd2963c0c47c726/src/main/thrift/parquet.thrift#L918)). Do you concur and think this should be another blocking issue or part of this PR? Thanks for the information. @etseidl Yes I have already noticed that a record may span across different pages. But in the parquet-cpp, the page size check always happens at the end of each batch. Therefore it guarantees that a page will not split any record. Please check this function as well as where it is called for reference: https://github.com/apache/arrow/blob/master/cpp/src/parquet/column_writer.cc#L1376 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
