martin-traverse commented on issue #735: URL: https://github.com/apache/arrow-java/issues/735#issuecomment-2927647261
Hello. I have been working on full support for Avro read / write, using the arrow-avro adapter. The intention is to provide a high-level API that allows reading / writing whole files block by block, with each block corresponding to one VSR and supporting VSR recycling. I'd like to finish the Avro work first. But. Could we do something similar for Parquet? It would be a very simple read / write adapter for whole files, so querying / analytics etc would still want to use the Dataset code path. Of course there is a concern about maintenance overhead, but, I do think this is a gap and a solution would be useful. For example in my project we use Arrow-Java to translate to/from a range of formats, so naturally we want to include Parquet in that. Previously we've gone to the parquet-java API directly and had to build our own internal representation from there - a fair amount of work which would be much better handled in something like Arrow. Also there is the issue of Hadoop dependencies, I know the Parquet guys have been working to eliminate them but there is still quite a bit of fiddling needed avoid pulling those dependencies in - if we could package all that up into an Arrow adapter with a much smaller dependency tree, I think that would be valuabl e as well. I'd be happy to sketch out some ideas for review once the Avro work is complete. The parquet-java project already has code for schema translation which helps and somewhere I've got some old code from the last time I looked at this which might save some time as well. Interested to know people's thoughts on this! In any case there are still one or two commits to go on Avro first. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
