mapleFU commented on PR #36286: URL: https://github.com/apache/arrow/pull/36286#issuecomment-1609882455
> Can you explain which memory saving you're talking about? Can you show an example? When user writes to parquet table row-by-row, we usally batch the rows, then converting the batch to arrow, and write arrow batch to parquet, and when file is large enough, we will close writer and write to oss. Usally when we write to parquet, `parquet::arrow::WriteTable` is finally called, it will finish RowGroup once. So our files end with hundred or even thousands of RowGroups, which causing Metadata grows huge. This patch exports `parquet::arrow::FileWriter::WriteBatch`, which will buffer the input in one RowGroup. Comparing to direct buffer the records in memory or in arrow file, buffering to parquet might be more straightforward @pitrou -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
