Thanks for raising this, JB! To clarify for people on Parquet mailing list who are not familiar with Beam:
Beam supports multiple filesystems (currently: local, HDFS, Google Cloud, S3) via a pluggable interface (that among other things can give you a Channel for reading/writing the given path), and we'd like to be able to read and write Parquet files to any of the supported filesystems. The current AvroParquetReader/Writer API, that takes a Path to the file, supports only local and HDFS files. We would like to be able to read Parquet files via a ReadableByteChannel or InputStream, and write via WritableByteChannel or OutputStream. (JB raised the issue for writing, but I just realized that it affects reading to the same extent) ParquetFileWriter constructor with OutputFile seems to help with that, likewise, ParquetFileReader with InputFile. Generally this seems to be part of https://issues.apache.org/jira/browse/PARQUET-1142 (marked "fixed" for 1.10 but I wasn't ) and https://issues.apache.org/jira/browse/PARQUET-1126 . On Tue, Feb 13, 2018 at 12:31 PM Jean-Baptiste Onofré <j...@nanthrax.net> wrote: > Hi guys, > > I'm working on the Apache Beam ParquetIO: > > https://github.com/apache/beam/pull/1851 > > In Beam, thanks to FileIO, we support several filesystems (HDFS, S3, ...). > > If I was able to implement the Read part using AvroParquetReader > leveraging Beam > FileIO, I'm struggling on the writing part. > > I have to create ParquetSink implementing FileIO.Sink. Especially, I have > to > implement the open(WritableByteChannel channel) method. > > It's not possible to use AvroParquetWriter here as it takes a Path as > argument > (and from the channel, I can only have an OutputStream). > > As a workaround, I wanted to use > org.apache.parquet.hadoop.ParquetFileWriter, > providing my own implementation of org.apache.parquet.io.OutputFile. > > Unfortunately OutputFile (and the updated method in ParquetFileWriter) > exists on > Parquet master branch, but it was different on Parquet 1.9.0. > > So, I have two questions: > - do you plan a Parquet 1.9.1 release including org.apache.parquet.io > .OutputFile > and updated org.apache.parquet.hadoop.ParquetFileWriter ? > - using Parquet 1.9.0, do you have any advice how to use > AvroParquetWriter/ParquetFileWriter with an OutputStream (or any object > that I > can get from WritableByteChannel) ? > > Thanks ! > > Regards > JB > -- > Jean-Baptiste Onofré > jbono...@apache.org > http://blog.nanthrax.net > Talend - http://www.talend.com >