Great !!

In the mean time, I started to PoC around directly parquet-common to see if I
can implement a BeamParquetReader and a BeamParquetWriter.

I might also propose some PRs.

I will continue tomorrow around that.

Thanks again !
Regards
JB

On 02/14/2018 08:04 PM, Ryan Blue wrote:
> Additions to the builders are easy enough that we can get that in. There's
> a PR out there that needs to be fixed:
> https://github.com/apache/parquet-mr/pull/446
> 
> I've asked the author for just the builder changes. If we don't hear back,
> we can add another PR but I'd like to give the author some time to update.
> 
> rb
> 
> On Tue, Feb 13, 2018 at 9:20 PM, Jean-Baptiste Onofré <j...@nanthrax.net>
> wrote:
> 
>> Hi  Ryan,
>>
>> Thanks for the update.
>>
>> Ideally for Beam, it would be great to have the AvroParquetReader and
>> AvroParquetWriter using the InputFile/OutputFile interfaces. It would
>> allow me
>> to directly leverage Beam FileIO.
>>
>> Do you have a rough date for the Parquet release with that ?
>>
>> Thanks
>> Regards
>> JB
>>
>> On 02/14/2018 02:01 AM, Ryan Blue wrote:
>>> Jean-Baptiste,
>>>
>>> We're planning a release that will include the new OutputFile class,
>> which I
>>> think you should be able to use. Is there anything you'd change to make
>> this
>>> work more easily with Beam?
>>>
>>> rb
>>>
>>> On Tue, Feb 13, 2018 at 12:31 PM, Jean-Baptiste Onofré <j...@nanthrax.net
>>> <mailto:j...@nanthrax.net>> wrote:
>>>
>>>     Hi guys,
>>>
>>>     I'm working on the Apache Beam ParquetIO:
>>>
>>>     https://github.com/apache/beam/pull/1851
>>>     <https://github.com/apache/beam/pull/1851>
>>>
>>>     In Beam, thanks to FileIO, we support several filesystems (HDFS, S3,
>> ...).
>>>
>>>     If I was able to implement the Read part using AvroParquetReader
>> leveraging Beam
>>>      FileIO, I'm struggling on the writing part.
>>>
>>>     I have to create ParquetSink implementing FileIO.Sink. Especially, I
>> have to
>>>     implement the open(WritableByteChannel channel) method.
>>>
>>>     It's not possible to use AvroParquetWriter here as it takes a Path
>> as argument
>>>     (and from the channel, I can only have an OutputStream).
>>>
>>>     As a workaround, I wanted to use org.apache.parquet.hadoop.
>> ParquetFileWriter,
>>>     providing my own implementation of org.apache.parquet.io
>>>     <http://org.apache.parquet.io>.OutputFile.
>>>
>>>     Unfortunately OutputFile (and the updated method in
>> ParquetFileWriter) exists on
>>>     Parquet master branch, but it was different on Parquet 1.9.0.
>>>
>>>     So, I have two questions:
>>>     - do you plan a Parquet 1.9.1 release including
>> org.apache.parquet.io
>>>     <http://org.apache.parquet.io>.OutputFile
>>>     and updated org.apache.parquet.hadoop.ParquetFileWriter ?
>>>     - using Parquet 1.9.0, do you have any advice how to use
>>>     AvroParquetWriter/ParquetFileWriter with an OutputStream (or any
>> object that I
>>>     can get from WritableByteChannel) ?
>>>
>>>     Thanks !
>>>
>>>     Regards
>>>     JB
>>>     --
>>>     Jean-Baptiste Onofré
>>>     jbono...@apache.org <mailto:jbono...@apache.org>
>>>     http://blog.nanthrax.net
>>>     Talend - http://www.talend.com
>>>
>>>
>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>
>> --
>> Jean-Baptiste Onofré
>> jbono...@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
> 
> 
> 

-- 
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Reply via email to