Re: [I] Support parquet write from Arrow record batch [arrow-java]

via GitHub Sun, 01 Jun 2025 11:16:07 -0700


martin-traverse commented on issue #735:
URL: https://github.com/apache/arrow-java/issues/735#issuecomment-2927647261


   Hello. I have been working on full support for Avro read / write, using the 
arrow-avro adapter. The intention is to provide a high-level API that allows 
reading / writing whole files block by block, with each block corresponding to 
one VSR and supporting VSR recycling.
   
   I'd like to finish the Avro work first. But. Could we do something similar 
for Parquet? It would be a very simple read / write adapter for whole files, so 
querying / analytics etc would still want to use the Dataset code path. Of 
course there is a concern about maintenance overhead, but, I do think this is a 
gap and a solution would be useful. For example in my project we use Arrow-Java 
to translate to/from a range of formats, so naturally we want to include 
Parquet in that. Previously we've gone to the parquet-java API directly and had 
to build our own internal representation from there - a fair amount of work 
which would be much better handled in something like Arrow. Also there is the 
issue of Hadoop dependencies, I know the Parquet guys have been working to 
eliminate them but there is still quite a bit of fiddling needed avoid pulling 
those dependencies in - if we could package all that up into an Arrow adapter 
with a much smaller dependency tree, I think that would be valuabl
 e as well.
   
   I'd be happy to sketch out some ideas for review once the Avro work is 
complete. The parquet-java project already has code for schema translation 
which helps and somewhere I've got some old code from the last time I looked at 
this which might save some time as well.
   
   Interested to know people's thoughts on this! In any case there are still 
one or two commits to go on Avro first.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Support parquet write from Arrow record batch [arrow-java]

Reply via email to