bashir2 commented on a change in pull request #14227:
URL: https://github.com/apache/beam/pull/14227#discussion_r595409910
##########
File path:
sdks/java/io/parquet/src/main/java/org/apache/beam/sdk/io/parquet/ParquetIO.java
##########
@@ -1054,6 +1054,7 @@ public static Sink sink(Schema schema) {
return new AutoValue_ParquetIO_Sink.Builder()
.setJsonSchema(schema.toString())
.setCompressionCodec(CompressionCodecName.SNAPPY)
+ .setRowGroupSize(0)
Review comment:
I agree with your point that `0` looks like magic here (how about adding
a one line comment?). But the problem with picking a default here is that I
feel this is something that needs to be left to the underlying writer. For
example, I am not sure if `ParquetWriter.DEFAULT_BLOCK_SIZE` is even the
"right" choice because there is also
[`ParquetWriter.DEFAULT_PAGE_SIZE`](https://github.com/apache/parquet-mr/blob/0efbfaa9a7b02bbd2a5912313ca77a9550da8ea1/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetWriter.java#L43)
which actually has a different value
[here](https://github.com/apache/parquet-mr/blob/616c3d681c5506ebad83a7d3724e85da2154eac9/parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java#L47)
(i.e., 1MB vs 128MB for default *block* size).
It does seem to me that in some places `rowGroupSize` and `blockSize` are
used interchangeably in `ParquetWriter`. For example [this constructor
call](https://github.com/apache/parquet-mr/blob/0efbfaa9a7b02bbd2a5912313ca77a9550da8ea1/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetWriter.java#L229)
is passing its `blockSize` parameter as the `rowGroupSize` of the called
constructor. If you feel strongly about taking a stance here, it is okay with
me but because of these complexities in `ParquetWriter` I thought I just leave
the choice of the default to that. WDYT?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]