etseidl commented on PR #34054: URL: https://github.com/apache/arrow/pull/34054#issuecomment-1426607039
> Yes I have already noticed that a record may span across different pages. But in the parquet-cpp, the page size check always happens at the end of each batch. Therefore it guarantees that a page will not split any record. Please check this function as well as where it is called for reference: https://github.com/apache/arrow/blob/master/cpp/src/parquet/column_writer.cc#L1376 Perhaps I'm misunderstanding, but it appears that the function you referenced is called after a batch of values is written...I don't see where it is guaranteed that the end of a batch is also the end of a row. But thanks for working on the page indexes, I think it's an important feature that arrow-cpp currently lacks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
