wgtmac commented on PR #34054:
URL: https://github.com/apache/arrow/pull/34054#issuecomment-1426616058

   > > Yes I have already noticed that a record may span across different 
pages. But in the parquet-cpp, the page size check always happens at the end of 
each batch. Therefore it guarantees that a page will not split any record. 
Please check this function as well as where it is called for reference: 
https://github.com/apache/arrow/blob/master/cpp/src/parquet/column_writer.cc#L1376
   > 
   > Perhaps I'm misunderstanding, but it appears that the function you 
referenced is called after a batch of values is written...I don't see where it 
is guaranteed that the end of a batch is also the end of a row. But thanks for 
working on the page indexes, I think it's an important feature that arrow-cpp 
currently lacks.
   
   Please correct me if I am wrong. At least the arrow parquet writer 
guarantees this by calling `ColumnWriter::WriteArrow` like this: 
https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/writer.cc#L154.
 Yes, the ParquetFileWriter itself does not prevent this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to