Bashir Sadjad created BEAM-11969:
------------------------------------
Summary: Make row-group size configurable in ParquetIO.Sink
Key: BEAM-11969
URL: https://issues.apache.org/jira/browse/BEAM-11969
Project: Beam
Issue Type: Improvement
Components: io-java-parquet
Reporter: Bashir Sadjad
It doesn't seem that ParquetIO.Sink has an option for setting row-group size.
Its builder has a
[withConfiguration|https://github.com/apache/beam/blob/fffb85a35df6ae3bdb2934c077856f6b27559aa7/sdks/java/io/parquet/src/main/java/org/apache/beam/sdk/io/parquet/ParquetIO.java#L1089]
but it does not seem to change rowGroupSize in
[ParquetWriter.Builder|https://github.com/apache/parquet-mr/blob/bdf935a43bd377c8052840a4328cf5b7603aa70a/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetWriter.java#L636]
and hence the default 128MB is used. It should be fairly easy to add the
plumbing for setting this option
[here|https://github.com/apache/beam/blob/fffb85a35df6ae3bdb2934c077856f6b27559aa7/sdks/java/io/parquet/src/main/java/org/apache/beam/sdk/io/parquet/ParquetIO.java#L1112].
--
This message was sent by Atlassian Jira
(v8.3.4#803005)