FYI: Apache Parquet 1.10.0 was release recently. 
It contains org.apache.parquet.io.OutputFile and updated 
org.apache.parquet.hadoop.ParquetFileWriter

WBR,
Alexey

> On 14 Feb 2018, at 20:10, Jean-Baptiste Onofré <j...@nanthrax.net> wrote:
> 
> Great !!
> 
> In the mean time, I started to PoC around directly parquet-common to see if I
> can implement a BeamParquetReader and a BeamParquetWriter.
> 
> I might also propose some PRs.
> 
> I will continue tomorrow around that.
> 
> Thanks again !
> Regards
> JB
> 
> On 02/14/2018 08:04 PM, Ryan Blue wrote:
>> Additions to the builders are easy enough that we can get that in. There's
>> a PR out there that needs to be fixed:
>> https://github.com/apache/parquet-mr/pull/446
>> 
>> I've asked the author for just the builder changes. If we don't hear back,
>> we can add another PR but I'd like to give the author some time to update.
>> 
>> rb
>> 
>> On Tue, Feb 13, 2018 at 9:20 PM, Jean-Baptiste Onofré <j...@nanthrax.net>
>> wrote:
>> 
>>> Hi  Ryan,
>>> 
>>> Thanks for the update.
>>> 
>>> Ideally for Beam, it would be great to have the AvroParquetReader and
>>> AvroParquetWriter using the InputFile/OutputFile interfaces. It would
>>> allow me
>>> to directly leverage Beam FileIO.
>>> 
>>> Do you have a rough date for the Parquet release with that ?
>>> 
>>> Thanks
>>> Regards
>>> JB
>>> 
>>> On 02/14/2018 02:01 AM, Ryan Blue wrote:
>>>> Jean-Baptiste,
>>>> 
>>>> We're planning a release that will include the new OutputFile class,
>>> which I
>>>> think you should be able to use. Is there anything you'd change to make
>>> this
>>>> work more easily with Beam?
>>>> 
>>>> rb
>>>> 
>>>> On Tue, Feb 13, 2018 at 12:31 PM, Jean-Baptiste Onofré <j...@nanthrax.net
>>>> <mailto:j...@nanthrax.net>> wrote:
>>>> 
>>>>    Hi guys,
>>>> 
>>>>    I'm working on the Apache Beam ParquetIO:
>>>> 
>>>>    https://github.com/apache/beam/pull/1851
>>>>    <https://github.com/apache/beam/pull/1851>
>>>> 
>>>>    In Beam, thanks to FileIO, we support several filesystems (HDFS, S3,
>>> ...).
>>>> 
>>>>    If I was able to implement the Read part using AvroParquetReader
>>> leveraging Beam
>>>>     FileIO, I'm struggling on the writing part.
>>>> 
>>>>    I have to create ParquetSink implementing FileIO.Sink. Especially, I
>>> have to
>>>>    implement the open(WritableByteChannel channel) method.
>>>> 
>>>>    It's not possible to use AvroParquetWriter here as it takes a Path
>>> as argument
>>>>    (and from the channel, I can only have an OutputStream).
>>>> 
>>>>    As a workaround, I wanted to use org.apache.parquet.hadoop.
>>> ParquetFileWriter,
>>>>    providing my own implementation of org.apache.parquet.io
>>>>    <http://org.apache.parquet.io>.OutputFile.
>>>> 
>>>>    Unfortunately OutputFile (and the updated method in
>>> ParquetFileWriter) exists on
>>>>    Parquet master branch, but it was different on Parquet 1.9.0.
>>>> 
>>>>    So, I have two questions:
>>>>    - do you plan a Parquet 1.9.1 release including
>>> org.apache.parquet.io
>>>>    <http://org.apache.parquet.io>.OutputFile
>>>>    and updated org.apache.parquet.hadoop.ParquetFileWriter ?
>>>>    - using Parquet 1.9.0, do you have any advice how to use
>>>>    AvroParquetWriter/ParquetFileWriter with an OutputStream (or any
>>> object that I
>>>>    can get from WritableByteChannel) ?
>>>> 
>>>>    Thanks !
>>>> 
>>>>    Regards
>>>>    JB
>>>>    --
>>>>    Jean-Baptiste Onofré
>>>>    jbono...@apache.org <mailto:jbono...@apache.org>
>>>>    http://blog.nanthrax.net
>>>>    Talend - http://www.talend.com
>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>> 
>>> --
>>> Jean-Baptiste Onofré
>>> jbono...@apache.org
>>> http://blog.nanthrax.net
>>> Talend - http://www.talend.com
>>> 
>> 
>> 
>> 
> 
> -- 
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com

Reply via email to