Re: [PROPOSAL] ParquetIO support for Python SDK

Heejong Lee Tue, 30 Oct 2018 14:42:42 -0700

Thanks all for the valuable feedback on the document. Here's the summary of
planned features for ParquetIO Python SDK:


   -

   Can read from Parquet file on any storage system supported by Beam
   -

   Can write to Parquet file on any storage system supported by Beam
   -

   Can configure the compression algorithm of output files
   -

   Can adjust the size of the row group
   -

   Can read multiple row groups in a single file parallelly (source
   splitting)
   -

   Can partially read by columns


It introduces new dependency pyarrow for parquet reading and writing
operations.

If you're interested, you can review and test the PR
https://github.com/apache/beam/pull/6763

Thanks,

On Wed, Oct 24, 2018 at 5:37 PM Chamikara Jayalath <[email protected]>
wrote:

> Thanks Heejong. Added some comments. +1 for summarizing the doc in the
> email thread.
>
> - Cham
>
> On Wed, Oct 24, 2018 at 4:45 PM Ahmet Altay <[email protected]> wrote:
>
>> Thank you Heejong. Could you also share a summary of the design document
>> (major points/decisions) in the mailing list?
>>
>> On Wed, Oct 24, 2018 at 4:08 PM, Heejong Lee <[email protected]> wrote:
>>
>>> Hi,
>>>
>>> I'm working on BEAM-4444: Parquet IO for Python SDK.
>>>
>>> Issue: https://issues.apache.org/jira/browse/BEAM-4444
>>> Design doc:
>>> https://docs.google.com/document/d/1-FT6zmjYhYFWXL8aDM5mNeiUnZdKnnB021zTo4S-0Wg
>>> WIP PR: https://github.com/apache/beam/pull/6763
>>>
>>> Any feedback is appreciated. Thanks!
>>>
>>>
>>

Re: [PROPOSAL] ParquetIO support for Python SDK

Reply via email to