I'm working on the Apache Beam ParquetIO:
In Beam, thanks to FileIO, we support several filesystems (HDFS, S3, ...).
If I was able to implement the Read part using AvroParquetReader leveraging Beam
FileIO, I'm struggling on the writing part.
I have to create ParquetSink implementing FileIO.Sink. Especially, I have to
implement the open(WritableByteChannel channel) method.
It's not possible to use AvroParquetWriter here as it takes a Path as argument
(and from the channel, I can only have an OutputStream).
As a workaround, I wanted to use org.apache.parquet.hadoop.ParquetFileWriter,
providing my own implementation of org.apache.parquet.io.OutputFile.
Unfortunately OutputFile (and the updated method in ParquetFileWriter) exists on
Parquet master branch, but it was different on Parquet 1.9.0.
So, I have two questions:
- do you plan a Parquet 1.9.1 release including org.apache.parquet.io.OutputFile
and updated org.apache.parquet.hadoop.ParquetFileWriter ?
- using Parquet 1.9.0, do you have any advice how to use
AvroParquetWriter/ParquetFileWriter with an OutputStream (or any object that I
can get from WritableByteChannel) ?
Talend - http://www.talend.com