Re: [PROPOSAL] ParquetIO support for Python SDK

2018-11-13 Thread Heejong Lee
In current PR, there will be two parameters that can control the final row
group size, row_group_buffer_size and record_batch_size. The records are
first stored as a list of columns and then transformed into a record batch
(a data structure defined in pyarrow) when the number of records in the
list reaches record_batch_size. Record batches form another list that will
be written as a single row group when the byte size of the record batch
list exceeds row_group_buffer_size. row_group_buffer_size is normally much
bigger than a row group data size in a parquet file so it's not an exact
estimation of a row group size written in a file but I guess this is the
best option we can do on the given limitation of python parquet libraries.
For better estimation of row group size in bytes, the parquet library
should provide buffered writing of a row group and a method returning the
size of encoded data in the writing buffer. No currently available python
parquet library implements these features.


On Tue, Nov 13, 2018 at 4:44 AM Robert Bradshaw  wrote:

> Was there resolution on how to handle row group size, given that it's
> hard to pick a decent default? IIRC, the ideal was to base this on
> byte sizes; will this be in v1 or will there be other parameter(s)
> that we'll have to support going forward?
> On Tue, Oct 30, 2018 at 10:42 PM Heejong Lee  wrote:
> >
> > Thanks all for the valuable feedback on the document. Here's the summary
> of planned features for ParquetIO Python SDK:
> >
> > Can read from Parquet file on any storage system supported by Beam
> >
> > Can write to Parquet file on any storage system supported by Beam
> >
> > Can configure the compression algorithm of output files
> >
> > Can adjust the size of the row group
> >
> > Can read multiple row groups in a single file parallelly (source
> splitting)
> >
> > Can partially read by columns
> >
> >
> > It introduces new dependency pyarrow for parquet reading and writing
> operations.
> >
> > If you're interested, you can review and test the PR
> https://github.com/apache/beam/pull/6763
> >
> > Thanks,
> >
> > On Wed, Oct 24, 2018 at 5:37 PM Chamikara Jayalath 
> wrote:
> >>
> >> Thanks Heejong. Added some comments. +1 for summarizing the doc in the
> email thread.
> >>
> >> - Cham
> >>
> >> On Wed, Oct 24, 2018 at 4:45 PM Ahmet Altay  wrote:
> >>>
> >>> Thank you Heejong. Could you also share a summary of the design
> document (major points/decisions) in the mailing list?
> >>>
> >>> On Wed, Oct 24, 2018 at 4:08 PM, Heejong Lee 
> wrote:
> 
>  Hi,
> 
>  I'm working on BEAM-: Parquet IO for Python SDK.
> 
>  Issue: https://issues.apache.org/jira/browse/BEAM-
>  Design doc:
> https://docs.google.com/document/d/1-FT6zmjYhYFWXL8aDM5mNeiUnZdKnnB021zTo4S-0Wg
>  WIP PR: https://github.com/apache/beam/pull/6763
> 
>  Any feedback is appreciated. Thanks!
> 
> >>>
>


Re: [PROPOSAL] ParquetIO support for Python SDK

2018-11-13 Thread Robert Bradshaw
Was there resolution on how to handle row group size, given that it's
hard to pick a decent default? IIRC, the ideal was to base this on
byte sizes; will this be in v1 or will there be other parameter(s)
that we'll have to support going forward?
On Tue, Oct 30, 2018 at 10:42 PM Heejong Lee  wrote:
>
> Thanks all for the valuable feedback on the document. Here's the summary of 
> planned features for ParquetIO Python SDK:
>
> Can read from Parquet file on any storage system supported by Beam
>
> Can write to Parquet file on any storage system supported by Beam
>
> Can configure the compression algorithm of output files
>
> Can adjust the size of the row group
>
> Can read multiple row groups in a single file parallelly (source splitting)
>
> Can partially read by columns
>
>
> It introduces new dependency pyarrow for parquet reading and writing 
> operations.
>
> If you're interested, you can review and test the PR 
> https://github.com/apache/beam/pull/6763
>
> Thanks,
>
> On Wed, Oct 24, 2018 at 5:37 PM Chamikara Jayalath  
> wrote:
>>
>> Thanks Heejong. Added some comments. +1 for summarizing the doc in the email 
>> thread.
>>
>> - Cham
>>
>> On Wed, Oct 24, 2018 at 4:45 PM Ahmet Altay  wrote:
>>>
>>> Thank you Heejong. Could you also share a summary of the design document 
>>> (major points/decisions) in the mailing list?
>>>
>>> On Wed, Oct 24, 2018 at 4:08 PM, Heejong Lee  wrote:

 Hi,

 I'm working on BEAM-: Parquet IO for Python SDK.

 Issue: https://issues.apache.org/jira/browse/BEAM-
 Design doc: 
 https://docs.google.com/document/d/1-FT6zmjYhYFWXL8aDM5mNeiUnZdKnnB021zTo4S-0Wg
 WIP PR: https://github.com/apache/beam/pull/6763

 Any feedback is appreciated. Thanks!

>>>


Re: [PROPOSAL] ParquetIO support for Python SDK

2018-10-30 Thread Heejong Lee
Thanks all for the valuable feedback on the document. Here's the summary of
planned features for ParquetIO Python SDK:

   -

   Can read from Parquet file on any storage system supported by Beam
   -

   Can write to Parquet file on any storage system supported by Beam
   -

   Can configure the compression algorithm of output files
   -

   Can adjust the size of the row group
   -

   Can read multiple row groups in a single file parallelly (source
   splitting)
   -

   Can partially read by columns


It introduces new dependency pyarrow for parquet reading and writing
operations.

If you're interested, you can review and test the PR
https://github.com/apache/beam/pull/6763

Thanks,

On Wed, Oct 24, 2018 at 5:37 PM Chamikara Jayalath 
wrote:

> Thanks Heejong. Added some comments. +1 for summarizing the doc in the
> email thread.
>
> - Cham
>
> On Wed, Oct 24, 2018 at 4:45 PM Ahmet Altay  wrote:
>
>> Thank you Heejong. Could you also share a summary of the design document
>> (major points/decisions) in the mailing list?
>>
>> On Wed, Oct 24, 2018 at 4:08 PM, Heejong Lee  wrote:
>>
>>> Hi,
>>>
>>> I'm working on BEAM-: Parquet IO for Python SDK.
>>>
>>> Issue: https://issues.apache.org/jira/browse/BEAM-
>>> Design doc:
>>> https://docs.google.com/document/d/1-FT6zmjYhYFWXL8aDM5mNeiUnZdKnnB021zTo4S-0Wg
>>> WIP PR: https://github.com/apache/beam/pull/6763
>>>
>>> Any feedback is appreciated. Thanks!
>>>
>>>
>>


Re: [PROPOSAL] ParquetIO support for Python SDK

2018-10-24 Thread Chamikara Jayalath
Thanks Heejong. Added some comments. +1 for summarizing the doc in the
email thread.

- Cham

On Wed, Oct 24, 2018 at 4:45 PM Ahmet Altay  wrote:

> Thank you Heejong. Could you also share a summary of the design document
> (major points/decisions) in the mailing list?
>
> On Wed, Oct 24, 2018 at 4:08 PM, Heejong Lee  wrote:
>
>> Hi,
>>
>> I'm working on BEAM-: Parquet IO for Python SDK.
>>
>> Issue: https://issues.apache.org/jira/browse/BEAM-
>> Design doc:
>> https://docs.google.com/document/d/1-FT6zmjYhYFWXL8aDM5mNeiUnZdKnnB021zTo4S-0Wg
>> WIP PR: https://github.com/apache/beam/pull/6763
>>
>> Any feedback is appreciated. Thanks!
>>
>>
>


Re: [PROPOSAL] ParquetIO support for Python SDK

2018-10-24 Thread Ahmet Altay
Thank you Heejong. Could you also share a summary of the design document
(major points/decisions) in the mailing list?

On Wed, Oct 24, 2018 at 4:08 PM, Heejong Lee  wrote:

> Hi,
>
> I'm working on BEAM-: Parquet IO for Python SDK.
>
> Issue: https://issues.apache.org/jira/browse/BEAM-
> Design doc: https://docs.google.com/document/d/1-
> FT6zmjYhYFWXL8aDM5mNeiUnZdKnnB021zTo4S-0Wg
> WIP PR: https://github.com/apache/beam/pull/6763
>
> Any feedback is appreciated. Thanks!
>
>


[PROPOSAL] ParquetIO support for Python SDK

2018-10-24 Thread Heejong Lee
Hi,

I'm working on BEAM-: Parquet IO for Python SDK.

Issue: https://issues.apache.org/jira/browse/BEAM-
Design doc:
https://docs.google.com/document/d/1-FT6zmjYhYFWXL8aDM5mNeiUnZdKnnB021zTo4S-0Wg
WIP PR: https://github.com/apache/beam/pull/6763

Any feedback is appreciated. Thanks!