Re: [I] Why no bulk Arrow→Parquet write API in Java? How to avoid row-by-row RecordConsumer + optimize? [arrow-java]

via GitHub Tue, 11 Nov 2025 08:17:39 -0800


Fenil-v commented on issue #907:
URL: https://github.com/apache/arrow-java/issues/907#issuecomment-3517677803


   > > ### Describe the usage question you have. Please include as many useful 
details as possible.
   > > I have ~ 20KB objects that I need to write to Parquet efficiently from 
Java. In C++, C#, and Python there's a direct/bulk Arrow-Parquet write (e.g. 
WriteTable / write_table) that avoids row-by-row iteration, but in Java I only 
see row-by-row paths via RecordConsumer or internal/unstable column writers. 
Questions:
   > > 
   > > 1. Is there a supported bulk/columnar Arrow-Parquet write API in Java 
(e.g, VectorSchemaRoot
   > >    → Parquet) that avoids row-by-row calls?
   > > 2. If not, why is Java limited to row-by-row writes today? Any roadmap 
for feature parity with C++/Python/C#?
   > > 3. For now, what's the recommended optimization path to write 20KB 
objects at high throughput from Java (without JNI), or is JNI/Dataset the 
recommended route?
   > > 4. Any best practices (batch sizing, encodings, writer settings) to 
mitigate the row-by-row overhead?
   > > 
   > > ### Component(s)
   > > Java
   > 
   > > ### Describe the usage question you have. Please include as many useful 
details as possible.
   > > I have ~ 20KB objects that I need to write to Parquet efficiently from 
Java. In C++, C#, and Python there's a direct/bulk Arrow-Parquet write (e.g. 
WriteTable / write_table) that avoids row-by-row iteration, but in Java I only 
see row-by-row paths via RecordConsumer or internal/unstable column writers. 
Questions:
   > > 
   > > 1. Is there a supported bulk/columnar Arrow-Parquet write API in Java 
(e.g, VectorSchemaRoot
   > >    → Parquet) that avoids row-by-row calls?
   > > 2. If not, why is Java limited to row-by-row writes today? Any roadmap 
for feature parity with C++/Python/C#?
   > > 3. For now, what's the recommended optimization path to write 20KB 
objects at high throughput from Java (without JNI), or is JNI/Dataset the 
recommended route?
   > > 4. Any best practices (batch sizing, encodings, writer settings) to 
mitigate the row-by-row overhead?
   > > 
   > > ### Component(s)
   > > Java
   > 
   > [@pitrou](https://github.com/pitrou) any idea on this? Any help would be 
saver for me
   
   @julienledem 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Why no bulk Arrow→Parquet write API in Java? How to avoid row-by-row RecordConsumer + optimize? [arrow-java]

Reply via email to