[GitHub] [arrow] mapleFU commented on pull request #36286: GH-36280: [Python][Parquet] Export C++ WriteRecordBatch in Python API

via GitHub Tue, 27 Jun 2023 09:45:20 -0700


mapleFU commented on PR #36286:
URL: https://github.com/apache/arrow/pull/36286#issuecomment-1609882455


   >  Can you explain which memory saving you're talking about? Can you show an 
example?
   
   When user writes to parquet table row-by-row, we usally batch the rows, then 
converting the batch to arrow, and write arrow batch to parquet, and when file 
is large enough, we will close writer and write to oss. Usally when we write to 
parquet, `parquet::arrow::WriteTable` is finally called, it will finish 
RowGroup once. So our files end with hundred or even thousands of RowGroups, 
which causing Metadata grows huge.
   
   This patch exports `parquet::arrow::FileWriter::WriteBatch`, which will 
buffer the input in one RowGroup. Comparing to direct buffer the records in 
memory or in arrow file, buffering to parquet might be more straightforward 
@pitrou 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] mapleFU commented on pull request #36286: GH-36280: [Python][Parquet] Export C++ WriteRecordBatch in Python API

Reply via email to