Great !! In the mean time, I started to PoC around directly parquet-common to see if I can implement a BeamParquetReader and a BeamParquetWriter.
I might also propose some PRs. I will continue tomorrow around that. Thanks again ! Regards JB On 02/14/2018 08:04 PM, Ryan Blue wrote: > Additions to the builders are easy enough that we can get that in. There's > a PR out there that needs to be fixed: > https://github.com/apache/parquet-mr/pull/446 > > I've asked the author for just the builder changes. If we don't hear back, > we can add another PR but I'd like to give the author some time to update. > > rb > > On Tue, Feb 13, 2018 at 9:20 PM, Jean-Baptiste Onofré <j...@nanthrax.net> > wrote: > >> Hi Ryan, >> >> Thanks for the update. >> >> Ideally for Beam, it would be great to have the AvroParquetReader and >> AvroParquetWriter using the InputFile/OutputFile interfaces. It would >> allow me >> to directly leverage Beam FileIO. >> >> Do you have a rough date for the Parquet release with that ? >> >> Thanks >> Regards >> JB >> >> On 02/14/2018 02:01 AM, Ryan Blue wrote: >>> Jean-Baptiste, >>> >>> We're planning a release that will include the new OutputFile class, >> which I >>> think you should be able to use. Is there anything you'd change to make >> this >>> work more easily with Beam? >>> >>> rb >>> >>> On Tue, Feb 13, 2018 at 12:31 PM, Jean-Baptiste Onofré <j...@nanthrax.net >>> <mailto:j...@nanthrax.net>> wrote: >>> >>> Hi guys, >>> >>> I'm working on the Apache Beam ParquetIO: >>> >>> https://github.com/apache/beam/pull/1851 >>> <https://github.com/apache/beam/pull/1851> >>> >>> In Beam, thanks to FileIO, we support several filesystems (HDFS, S3, >> ...). >>> >>> If I was able to implement the Read part using AvroParquetReader >> leveraging Beam >>> FileIO, I'm struggling on the writing part. >>> >>> I have to create ParquetSink implementing FileIO.Sink. Especially, I >> have to >>> implement the open(WritableByteChannel channel) method. >>> >>> It's not possible to use AvroParquetWriter here as it takes a Path >> as argument >>> (and from the channel, I can only have an OutputStream). >>> >>> As a workaround, I wanted to use org.apache.parquet.hadoop. >> ParquetFileWriter, >>> providing my own implementation of org.apache.parquet.io >>> <http://org.apache.parquet.io>.OutputFile. >>> >>> Unfortunately OutputFile (and the updated method in >> ParquetFileWriter) exists on >>> Parquet master branch, but it was different on Parquet 1.9.0. >>> >>> So, I have two questions: >>> - do you plan a Parquet 1.9.1 release including >> org.apache.parquet.io >>> <http://org.apache.parquet.io>.OutputFile >>> and updated org.apache.parquet.hadoop.ParquetFileWriter ? >>> - using Parquet 1.9.0, do you have any advice how to use >>> AvroParquetWriter/ParquetFileWriter with an OutputStream (or any >> object that I >>> can get from WritableByteChannel) ? >>> >>> Thanks ! >>> >>> Regards >>> JB >>> -- >>> Jean-Baptiste Onofré >>> jbono...@apache.org <mailto:jbono...@apache.org> >>> http://blog.nanthrax.net >>> Talend - http://www.talend.com >>> >>> >>> >>> >>> -- >>> Ryan Blue >>> Software Engineer >>> Netflix >> >> -- >> Jean-Baptiste Onofré >> jbono...@apache.org >> http://blog.nanthrax.net >> Talend - http://www.talend.com >> > > > -- Jean-Baptiste Onofré jbono...@apache.org http://blog.nanthrax.net Talend - http://www.talend.com