[
https://issues.apache.org/jira/browse/BEAM-11969?focusedWorklogId=566874&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-566874
]
ASF GitHub Bot logged work on BEAM-11969:
-----------------------------------------
Author: ASF GitHub Bot
Created on: 16/Mar/21 11:26
Start Date: 16/Mar/21 11:26
Worklog Time Spent: 10m
Work Description: aromanenko-dev commented on a change in pull request
#14227:
URL: https://github.com/apache/beam/pull/14227#discussion_r595074164
##########
File path:
sdks/java/io/parquet/src/main/java/org/apache/beam/sdk/io/parquet/ParquetIO.java
##########
@@ -1054,6 +1054,7 @@ public static Sink sink(Schema schema) {
return new AutoValue_ParquetIO_Sink.Builder()
.setJsonSchema(schema.toString())
.setCompressionCodec(CompressionCodecName.SNAPPY)
+ .setRowGroupSize(0)
Review comment:
Would it make sense to set a default size here instead of just `0`?
##########
File path:
sdks/java/io/parquet/src/main/java/org/apache/beam/sdk/io/parquet/ParquetIO.java
##########
@@ -1097,6 +1102,12 @@ public Sink withConfiguration(Configuration
configuration) {
return toBuilder().setConfiguration(new
SerializableConfiguration(configuration)).build();
}
+ /** Specify row-group size; if not set, a default will be used by the
underlying writer. */
+ public Sink withRowGroupSize(int rowGroupSize) {
+ checkArgument(rowGroupSize > 0, "rowGroupSize should be positive");
Review comment:
nit: "**must** be"
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
Issue Time Tracking
-------------------
Worklog Id: (was: 566874)
Time Spent: 20m (was: 10m)
> Make row-group size configurable in ParquetIO.Sink
> --------------------------------------------------
>
> Key: BEAM-11969
> URL: https://issues.apache.org/jira/browse/BEAM-11969
> Project: Beam
> Issue Type: Improvement
> Components: io-java-parquet
> Reporter: Bashir Sadjad
> Priority: P2
> Labels: easyfix
> Time Spent: 20m
> Remaining Estimate: 0h
>
> It doesn't seem that ParquetIO.Sink has an option for setting row-group size.
> Its builder has a
> [withConfiguration|https://github.com/apache/beam/blob/fffb85a35df6ae3bdb2934c077856f6b27559aa7/sdks/java/io/parquet/src/main/java/org/apache/beam/sdk/io/parquet/ParquetIO.java#L1089]
> but it does not seem to change rowGroupSize in
> [ParquetWriter.Builder|https://github.com/apache/parquet-mr/blob/bdf935a43bd377c8052840a4328cf5b7603aa70a/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetWriter.java#L636]
> and hence the default 128MB is used. It should be fairly easy to add the
> plumbing for setting this option
> [here|https://github.com/apache/beam/blob/fffb85a35df6ae3bdb2934c077856f6b27559aa7/sdks/java/io/parquet/src/main/java/org/apache/beam/sdk/io/parquet/ParquetIO.java#L1112].
--
This message was sent by Atlassian Jira
(v8.3.4#803005)