Jean-Baptiste, We're planning a release that will include the new OutputFile class, which I think you should be able to use. Is there anything you'd change to make this work more easily with Beam?
rb On Tue, Feb 13, 2018 at 12:31 PM, Jean-Baptiste Onofré <[email protected]> wrote: > Hi guys, > > I'm working on the Apache Beam ParquetIO: > > https://github.com/apache/beam/pull/1851 > > In Beam, thanks to FileIO, we support several filesystems (HDFS, S3, ...). > > If I was able to implement the Read part using AvroParquetReader > leveraging Beam > FileIO, I'm struggling on the writing part. > > I have to create ParquetSink implementing FileIO.Sink. Especially, I have > to > implement the open(WritableByteChannel channel) method. > > It's not possible to use AvroParquetWriter here as it takes a Path as > argument > (and from the channel, I can only have an OutputStream). > > As a workaround, I wanted to use org.apache.parquet.hadoop. > ParquetFileWriter, > providing my own implementation of org.apache.parquet.io.OutputFile. > > Unfortunately OutputFile (and the updated method in ParquetFileWriter) > exists on > Parquet master branch, but it was different on Parquet 1.9.0. > > So, I have two questions: > - do you plan a Parquet 1.9.1 release including org.apache.parquet.io. > OutputFile > and updated org.apache.parquet.hadoop.ParquetFileWriter ? > - using Parquet 1.9.0, do you have any advice how to use > AvroParquetWriter/ParquetFileWriter with an OutputStream (or any object > that I > can get from WritableByteChannel) ? > > Thanks ! > > Regards > JB > -- > Jean-Baptiste Onofré > [email protected] > http://blog.nanthrax.net > Talend - http://www.talend.com > -- Ryan Blue Software Engineer Netflix
