[GitHub] [iceberg] bryanck commented on a diff in pull request #5345: Parquet: Add option to set page row count limit

GitBox Tue, 26 Jul 2022 10:01:49 -0700


bryanck commented on code in PR #5345:
URL: https://github.com/apache/iceberg/pull/5345#discussion_r930214751



##########
docs/configuration.md:
##########
@@ -46,45 +46,46 @@ Iceberg tables support table properties to configure table 
behavior, like the de
 
 ### Write properties
 
-| Property                           | Default            | Description        
                                |
-| ---------------------------------- | ------------------ | 
-------------------------------------------------- |
-| write.format.default               | parquet            | Default file 
format for the table; parquet, avro, or orc |
-| write.delete.format.default        | data file format   | Default delete 
file format for the table; parquet, avro, or orc |
-| write.parquet.row-group-size-bytes | 134217728 (128 MB) | Parquet row group 
size                             |
-| write.parquet.page-size-bytes      | 1048576 (1 MB)     | Parquet page size  
                                |
-| write.parquet.dict-size-bytes      | 2097152 (2 MB)     | Parquet dictionary 
page size                       |
-| write.parquet.compression-codec    | gzip               | Parquet 
compression codec: zstd, brotli, lz4, gzip, snappy, uncompressed |
-| write.parquet.compression-level    | null               | Parquet 
compression level                          |
-| write.parquet.bloom-filter-enabled.column.col1          | (not set) | 
Enables writing a bloom filter for the column: col1|
-| write.parquet.bloom-filter-max-bytes | 1048576 (1 MB)   | The maximum number 
of bytes for a bloom filter bitset |
-| write.avro.compression-codec       | gzip               | Avro compression 
codec: gzip(deflate with 9 level), zstd, snappy, uncompressed |
-| write.avro.compression-level       | null               | Avro compression 
level                             |
-| write.orc.stripe-size-bytes        | 67108864 (64 MB)   | Define the default 
ORC stripe size, in bytes       |
-| write.orc.block-size-bytes         | 268435456 (256 MB) | Define the default 
file system block size for ORC files |
-| write.orc.compression-codec        | zlib               | ORC compression 
codec: zstd, lz4, lzo, zlib, snappy, none |
-| write.orc.compression-strategy     | speed              | ORC compression 
strategy: speed, compression |
-| write.location-provider.impl       | null               | Optional custom 
implementation for LocationProvider  |
-| write.metadata.compression-codec   | none               | Metadata 
compression codec; none or gzip           |
-| write.metadata.metrics.default     | truncate(16)       | Default metrics 
mode for all columns in the table; none, counts, truncate(length), or full |
-| write.metadata.metrics.column.col1 | (not set)          | Metrics mode for 
column 'col1' to allow per-column tuning; none, counts, truncate(length), or 
full |
-| write.target-file-size-bytes       | 536870912 (512 MB) | Controls the size 
of files generated to target about this many bytes |
-| write.delete.target-file-size-bytes| 67108864 (64 MB)   | Controls the size 
of delete files generated to target about this many bytes |
-| write.distribution-mode            | none               | Defines 
distribution of write data: __none__: don't shuffle rows; __hash__: hash 
distribute by partition key ; __range__: range distribute by partition key or 
sort key if table has an SortOrder |
-| write.delete.distribution-mode     | hash               | Defines 
distribution of write delete data          |
-| write.wap.enabled                  | false              | Enables 
write-audit-publish writes |
-| write.summary.partition-limit      | 0                  | Includes 
partition-level summary stats in snapshot summaries if the changed partition 
count is less than this limit |
-| write.metadata.delete-after-commit.enabled | false      | Controls whether 
to delete the oldest version metadata files after commit |
-| write.metadata.previous-versions-max       | 100        | The max number of 
previous version metadata files to keep before deleting after commit |
-| write.spark.fanout.enabled         | false              | Enables the fanout 
writer in Spark that does not require data to be clustered; uses more memory |
-| write.object-storage.enabled       | false              | Enables the object 
storage location provider that adds a hash component to file paths |
-| write.data.path                    | table location + /data | Base location 
for data files |
-| write.metadata.path                | table location + /metadata | Base 
location for metadata files |
-| write.delete.mode                  | copy-on-write      | Mode used for 
delete commands: copy-on-write or merge-on-read (v2 only) |
-| write.delete.isolation-level       | serializable       | Isolation level 
for delete commands: serializable or snapshot |
-| write.update.mode                  | copy-on-write      | Mode used for 
update commands: copy-on-write or merge-on-read (v2 only) |
-| write.update.isolation-level       | serializable       | Isolation level 
for update commands: serializable or snapshot |
-| write.merge.mode                   | copy-on-write      | Mode used for 
merge commands: copy-on-write or merge-on-read (v2 only) |
-| write.merge.isolation-level        | serializable       | Isolation level 
for merge commands: serializable or snapshot |
+| Property                                       | Default                    
| Description                                                                   
                                                                                
                                    |
+|------------------------------------------------|----------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| write.format.default                           | parquet                    
| Default file format for the table; parquet, avro, or orc                      
                                                                                
                                    |
+| write.delete.format.default                    | data file format           
| Default delete file format for the table; parquet, avro, or orc               
                                                                                
                                    |
+| write.parquet.row-group-size-bytes             | 134217728 (128 MB)         
| Parquet row group size                                                        
                                                                                
                                    |
+| write.parquet.page-size-bytes                  | 1048576 (1 MB)             
| Parquet page size                                                             
                                                                                
                                    |
+| write.parquet.page-row-count-limit             | 20000                      
| Parquet page row count limit                                                  
                                                                                
                                    |
+| write.parquet.dict-size-bytes                  | 2097152 (2 MB)             
| Parquet dictionary page size                                                  
                                                                                
                                    |
+| write.parquet.compression-codec                | gzip                       
| Parquet compression codec: zstd, brotli, lz4, gzip, snappy, uncompressed      
                                                                                
                                    |
+| write.parquet.compression-level                | null                       
| Parquet compression level                                                     
                                                                                
                                    |
+| write.parquet.bloom-filter-enabled.column.col1 | (not set)                  
| Enables writing a bloom filter for the column: col1                           
                                                                                
                                    |
+| write.parquet.bloom-filter-max-bytes           | 1048576 (1 MB)             
| The maximum number of bytes for a bloom filter bitset                         
                                                                                
                                    |
+| write.avro.compression-codec                   | gzip                       
| Avro compression codec: gzip(deflate with 9 level), zstd, snappy, 
uncompressed                                                                    
                                                |
+| write.avro.compression-level                   | null                       
| Avro compression level                                                        
                                                                                
                                    |
+| write.orc.stripe-size-bytes                    | 67108864 (64 MB)           
| Define the default ORC stripe size, in bytes                                  
                                                                                
                                    |
+| write.orc.block-size-bytes                     | 268435456 (256 MB)         
| Define the default file system block size for ORC files                       
                                                                                
                                    |
+| write.orc.compression-codec                    | zlib                       
| ORC compression codec: zstd, lz4, lzo, zlib, snappy, none                     
                                                                                
                                    |
+| write.orc.compression-strategy                 | speed                      
| ORC compression strategy: speed, compression                                  
                                                                                
                                    |
+| write.location-provider.impl                   | null                       
| Optional custom implementation for LocationProvider                           
                                                                                
                                    |
+| write.metadata.compression-codec               | none                       
| Metadata compression codec; none or gzip                                      
                                                                                
                                    |
+| write.metadata.metrics.default                 | truncate(16)               
| Default metrics mode for all columns in the table; none, counts, 
truncate(length), or full                                                       
                                                 |
+| write.metadata.metrics.column.col1             | (not set)                  
| Metrics mode for column 'col1' to allow per-column tuning; none, counts, 
truncate(length), or full                                                       
                                         |
+| write.target-file-size-bytes                   | 536870912 (512 MB)         
| Controls the size of files generated to target about this many bytes          
                                                                                
                                    |
+| write.delete.target-file-size-bytes            | 67108864 (64 MB)           
| Controls the size of delete files generated to target about this many bytes   
                                                                                
                                    |
+| write.distribution-mode                        | none                       
| Defines distribution of write data: __none__: don't shuffle rows; __hash__: 
hash distribute by partition key ; __range__: range distribute by partition key 
or sort key if table has an SortOrder |
+| write.delete.distribution-mode                 | hash                       
| Defines distribution of write delete data                                     
                                                                                
                                    |
+| write.wap.enabled                              | false                      
| Enables write-audit-publish writes                                            
                                                                                
                                    |
+| write.summary.partition-limit                  | 0                          
| Includes partition-level summary stats in snapshot summaries if the changed 
partition count is less than this limit                                         
                                      |
+| write.metadata.delete-after-commit.enabled     | false                      
| Controls whether to delete the oldest version metadata files after commit     
                                                                                
                                    |
+| write.metadata.previous-versions-max           | 100                        
| The max number of previous version metadata files to keep before deleting 
after commit                                                                    
                                        |
+| write.spark.fanout.enabled                     | false                      
| Enables the fanout writer in Spark that does not require data to be 
clustered; uses more memory                                                     
                                              |
+| write.object-storage.enabled                   | false                      
| Enables the object storage location provider that adds a hash component to 
file paths                                                                      
                                       |
+| write.data.path                                | table location + /data     
| Base location for data files                                                  
                                                                                
                                    |
+| write.metadata.path                            | table location + /metadata 
| Base location for metadata files                                              
                                                                                
                                    |
+| write.delete.mode                              | copy-on-write              
| Mode used for delete commands: copy-on-write or merge-on-read (v2 only)       
                                                                                
                                    |
+| write.delete.isolation-level                   | serializable               
| Isolation level for delete commands: serializable or snapshot                 
                                                                                
                                    |
+| write.update.mode                              | copy-on-write              
| Mode used for update commands: copy-on-write or merge-on-read (v2 only)       
                                                                                
                                    |
+| write.update.isolation-level                   | serializable               
| Isolation level for update commands: serializable or snapshot                 
                                                                                
                                    |
+| write.merge.mode                               | copy-on-write              
| Mode used for merge commands: copy-on-write or merge-on-read (v2 only)        
                                                                                
                                    |
+| write.merge.isolation-level                    | serializable               
| Isolation level for merge commands: serializable or snapshot                  
                                                                                
                                    |

Review Comment:
   Sure this is done (my editor got a little aggressive)



##########
core/src/main/java/org/apache/iceberg/TableProperties.java:
##########
@@ -143,6 +143,10 @@ private TableProperties() {
   public static final String DELETE_PARQUET_PAGE_SIZE_BYTES = 
"write.delete.parquet.page-size-bytes";
   public static final int PARQUET_PAGE_SIZE_BYTES_DEFAULT = 1024 * 1024; // 1 
MB
 
+  public static final String PARQUET_PAGE_ROW_COUNT_LIMIT = 
"write.parquet.page-row-count-limit";

Review Comment:
   This is done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] bryanck commented on a diff in pull request #5345: Parquet: Add option to set page row count limit

Reply via email to