FYI: Apache Parquet 1.10.0 was release recently. It contains org.apache.parquet.io.OutputFile and updated org.apache.parquet.hadoop.ParquetFileWriter
WBR, Alexey > On 14 Feb 2018, at 20:10, Jean-Baptiste Onofré <j...@nanthrax.net> wrote: > > Great !! > > In the mean time, I started to PoC around directly parquet-common to see if I > can implement a BeamParquetReader and a BeamParquetWriter. > > I might also propose some PRs. > > I will continue tomorrow around that. > > Thanks again ! > Regards > JB > > On 02/14/2018 08:04 PM, Ryan Blue wrote: >> Additions to the builders are easy enough that we can get that in. There's >> a PR out there that needs to be fixed: >> https://github.com/apache/parquet-mr/pull/446 >> >> I've asked the author for just the builder changes. If we don't hear back, >> we can add another PR but I'd like to give the author some time to update. >> >> rb >> >> On Tue, Feb 13, 2018 at 9:20 PM, Jean-Baptiste Onofré <j...@nanthrax.net> >> wrote: >> >>> Hi Ryan, >>> >>> Thanks for the update. >>> >>> Ideally for Beam, it would be great to have the AvroParquetReader and >>> AvroParquetWriter using the InputFile/OutputFile interfaces. It would >>> allow me >>> to directly leverage Beam FileIO. >>> >>> Do you have a rough date for the Parquet release with that ? >>> >>> Thanks >>> Regards >>> JB >>> >>> On 02/14/2018 02:01 AM, Ryan Blue wrote: >>>> Jean-Baptiste, >>>> >>>> We're planning a release that will include the new OutputFile class, >>> which I >>>> think you should be able to use. Is there anything you'd change to make >>> this >>>> work more easily with Beam? >>>> >>>> rb >>>> >>>> On Tue, Feb 13, 2018 at 12:31 PM, Jean-Baptiste Onofré <j...@nanthrax.net >>>> <mailto:j...@nanthrax.net>> wrote: >>>> >>>> Hi guys, >>>> >>>> I'm working on the Apache Beam ParquetIO: >>>> >>>> https://github.com/apache/beam/pull/1851 >>>> <https://github.com/apache/beam/pull/1851> >>>> >>>> In Beam, thanks to FileIO, we support several filesystems (HDFS, S3, >>> ...). >>>> >>>> If I was able to implement the Read part using AvroParquetReader >>> leveraging Beam >>>> FileIO, I'm struggling on the writing part. >>>> >>>> I have to create ParquetSink implementing FileIO.Sink. Especially, I >>> have to >>>> implement the open(WritableByteChannel channel) method. >>>> >>>> It's not possible to use AvroParquetWriter here as it takes a Path >>> as argument >>>> (and from the channel, I can only have an OutputStream). >>>> >>>> As a workaround, I wanted to use org.apache.parquet.hadoop. >>> ParquetFileWriter, >>>> providing my own implementation of org.apache.parquet.io >>>> <http://org.apache.parquet.io>.OutputFile. >>>> >>>> Unfortunately OutputFile (and the updated method in >>> ParquetFileWriter) exists on >>>> Parquet master branch, but it was different on Parquet 1.9.0. >>>> >>>> So, I have two questions: >>>> - do you plan a Parquet 1.9.1 release including >>> org.apache.parquet.io >>>> <http://org.apache.parquet.io>.OutputFile >>>> and updated org.apache.parquet.hadoop.ParquetFileWriter ? >>>> - using Parquet 1.9.0, do you have any advice how to use >>>> AvroParquetWriter/ParquetFileWriter with an OutputStream (or any >>> object that I >>>> can get from WritableByteChannel) ? >>>> >>>> Thanks ! >>>> >>>> Regards >>>> JB >>>> -- >>>> Jean-Baptiste Onofré >>>> jbono...@apache.org <mailto:jbono...@apache.org> >>>> http://blog.nanthrax.net >>>> Talend - http://www.talend.com >>>> >>>> >>>> >>>> >>>> -- >>>> Ryan Blue >>>> Software Engineer >>>> Netflix >>> >>> -- >>> Jean-Baptiste Onofré >>> jbono...@apache.org >>> http://blog.nanthrax.net >>> Talend - http://www.talend.com >>> >> >> >> > > -- > Jean-Baptiste Onofré > jbono...@apache.org > http://blog.nanthrax.net > Talend - http://www.talend.com