bashir2 commented on a change in pull request #14227:
URL: https://github.com/apache/beam/pull/14227#discussion_r595409910



##########
File path: 
sdks/java/io/parquet/src/main/java/org/apache/beam/sdk/io/parquet/ParquetIO.java
##########
@@ -1054,6 +1054,7 @@ public static Sink sink(Schema schema) {
     return new AutoValue_ParquetIO_Sink.Builder()
         .setJsonSchema(schema.toString())
         .setCompressionCodec(CompressionCodecName.SNAPPY)
+        .setRowGroupSize(0)

Review comment:
       I agree with your point that `0` looks like magic here (how about adding 
a one line comment?). But the problem with picking a default here is that I 
feel this is something that needs to be left to the underlying writer. For 
example, I am not sure if `ParquetWriter.DEFAULT_BLOCK_SIZE` is even the 
"right" choice because there is also  
[`ParquetWriter.DEFAULT_PAGE_SIZE`](https://github.com/apache/parquet-mr/blob/0efbfaa9a7b02bbd2a5912313ca77a9550da8ea1/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetWriter.java#L43)
 which actually has a different value 
[here](https://github.com/apache/parquet-mr/blob/616c3d681c5506ebad83a7d3724e85da2154eac9/parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java#L47)
 (i.e., 1MB vs 128MB for default *block* size).
   
   It does seem to me that in some places `rowGroupSize` and `blockSize` are 
used interchangeably in `ParquetWriter`. For example [this constructor 
call](https://github.com/apache/parquet-mr/blob/0efbfaa9a7b02bbd2a5912313ca77a9550da8ea1/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetWriter.java#L229)
 is passing its `blockSize` parameter as the `rowGroupSize` of the called 
constructor. If you feel strongly about taking a stance here, it is okay with 
me but because of these complexities in `ParquetWriter` I thought I just leave 
the choice of the default to that. WDYT?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to