Bashir Sadjad created BEAM-11969:
------------------------------------

             Summary: Make row-group size configurable in ParquetIO.Sink
                 Key: BEAM-11969
                 URL: https://issues.apache.org/jira/browse/BEAM-11969
             Project: Beam
          Issue Type: Improvement
          Components: io-java-parquet
            Reporter: Bashir Sadjad


It doesn't seem that ParquetIO.Sink has an option for setting row-group size. 
Its builder has a 
[withConfiguration|https://github.com/apache/beam/blob/fffb85a35df6ae3bdb2934c077856f6b27559aa7/sdks/java/io/parquet/src/main/java/org/apache/beam/sdk/io/parquet/ParquetIO.java#L1089]
 but it does not seem to change rowGroupSize in 
[ParquetWriter.Builder|https://github.com/apache/parquet-mr/blob/bdf935a43bd377c8052840a4328cf5b7603aa70a/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetWriter.java#L636]
 and hence the default 128MB is used. It should be fairly easy to add the 
plumbing for setting this option 
[here|https://github.com/apache/beam/blob/fffb85a35df6ae3bdb2934c077856f6b27559aa7/sdks/java/io/parquet/src/main/java/org/apache/beam/sdk/io/parquet/ParquetIO.java#L1112].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to