[ 
https://issues.apache.org/jira/browse/BEAM-11969?focusedWorklogId=567309&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-567309
 ]

ASF GitHub Bot logged work on BEAM-11969:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 16/Mar/21 22:00
            Start Date: 16/Mar/21 22:00
    Worklog Time Spent: 10m 
      Work Description: bashir2 commented on a change in pull request #14227:
URL: https://github.com/apache/beam/pull/14227#discussion_r595570559



##########
File path: 
sdks/java/io/parquet/src/main/java/org/apache/beam/sdk/io/parquet/ParquetIO.java
##########
@@ -1054,6 +1054,7 @@ public static Sink sink(Schema schema) {
     return new AutoValue_ParquetIO_Sink.Builder()
         .setJsonSchema(schema.toString())
         .setCompressionCodec(CompressionCodecName.SNAPPY)
+        .setRowGroupSize(0)

Review comment:
       I thought a little more about this and decided to go with your original 
suggestion. Now I think it is actually not a bad idea to expose a little bit of 
complexities inside `ParquetWriter` here to give a signal to the user that 
`rowGroupSize` is actually used for block-size setting too (and there is a 
comment too, so that should be fine).
   
   PTAL.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 567309)
    Time Spent: 1.5h  (was: 1h 20m)

> Make row-group size configurable in ParquetIO.Sink
> --------------------------------------------------
>
>                 Key: BEAM-11969
>                 URL: https://issues.apache.org/jira/browse/BEAM-11969
>             Project: Beam
>          Issue Type: Improvement
>          Components: io-java-parquet
>            Reporter: Bashir Sadjad
>            Priority: P2
>              Labels: easyfix
>          Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> It doesn't seem that ParquetIO.Sink has an option for setting row-group size. 
> Its builder has a 
> [withConfiguration|https://github.com/apache/beam/blob/fffb85a35df6ae3bdb2934c077856f6b27559aa7/sdks/java/io/parquet/src/main/java/org/apache/beam/sdk/io/parquet/ParquetIO.java#L1089]
>  but it does not seem to change rowGroupSize in 
> [ParquetWriter.Builder|https://github.com/apache/parquet-mr/blob/bdf935a43bd377c8052840a4328cf5b7603aa70a/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetWriter.java#L636]
>  and hence the default 128MB is used. It should be fairly easy to add the 
> plumbing for setting this option 
> [here|https://github.com/apache/beam/blob/fffb85a35df6ae3bdb2934c077856f6b27559aa7/sdks/java/io/parquet/src/main/java/org/apache/beam/sdk/io/parquet/ParquetIO.java#L1112].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to