Thanks all for the valuable feedback on the document. Here's the summary of planned features for ParquetIO Python SDK:
- Can read from Parquet file on any storage system supported by Beam - Can write to Parquet file on any storage system supported by Beam - Can configure the compression algorithm of output files - Can adjust the size of the row group - Can read multiple row groups in a single file parallelly (source splitting) - Can partially read by columns It introduces new dependency pyarrow for parquet reading and writing operations. If you're interested, you can review and test the PR https://github.com/apache/beam/pull/6763 Thanks, On Wed, Oct 24, 2018 at 5:37 PM Chamikara Jayalath <[email protected]> wrote: > Thanks Heejong. Added some comments. +1 for summarizing the doc in the > email thread. > > - Cham > > On Wed, Oct 24, 2018 at 4:45 PM Ahmet Altay <[email protected]> wrote: > >> Thank you Heejong. Could you also share a summary of the design document >> (major points/decisions) in the mailing list? >> >> On Wed, Oct 24, 2018 at 4:08 PM, Heejong Lee <[email protected]> wrote: >> >>> Hi, >>> >>> I'm working on BEAM-4444: Parquet IO for Python SDK. >>> >>> Issue: https://issues.apache.org/jira/browse/BEAM-4444 >>> Design doc: >>> https://docs.google.com/document/d/1-FT6zmjYhYFWXL8aDM5mNeiUnZdKnnB021zTo4S-0Wg >>> WIP PR: https://github.com/apache/beam/pull/6763 >>> >>> Any feedback is appreciated. Thanks! >>> >>> >>
