[GitHub] [iceberg] kbendick commented on a diff in pull request #5313: Orc: Support row group bloom filters

GitBox Tue, 16 Aug 2022 12:08:03 -0700


kbendick commented on code in PR #5313:
URL: https://github.com/apache/iceberg/pull/5313#discussion_r947158423



##########
docs/configuration.md:
##########
@@ -64,6 +64,8 @@ Iceberg tables support table properties to configure table 
behavior, like the de
 | write.orc.block-size-bytes         | 268435456 (256 MB) | Define the default 
file system block size for ORC files |
 | write.orc.compression-codec        | zlib               | ORC compression 
codec: zstd, lz4, lzo, zlib, snappy, none |
 | write.orc.compression-strategy     | speed              | ORC compression 
strategy: speed, compression |
+| write.orc.bloom.filter.columns     | (not set)          | Comma separated 
list of column names for which a Bloom filter must be created |
+| write.orc.bloom.filter.fpp         | 0.05               | False positive 
probability for Bloom filter (must > 0.0 and < 1.0) |

Review Comment:
   You might want to match the parquet configurations a bit more closely.
   
   They are
   ```
   | write.parquet.bloom-filter-enabled.column.col1          | (not set) | 
Enables writing a bloom filter for the column: col1|
   | write.parquet.bloom-filter-max-bytes | 1048576 (1 MB)   | The maximum 
number of bytes for a bloom filter bitset |
   ```
   
   So you could do `write.orc.bloom-filter-enabled.column.col1`. This also 
matches other config value formatting that's per column, such as 
`write.metadata.metrics.column.col1`.
   
   For the `fpp`, as that seems to be how the value is set on the ORC bloom 
filter, I would suggest keeping it that way. But if the parquet implementation 
is translating from the max-bytes to fpp, then possibly setting the config that 
way for consistency (but I doubt that it is).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] kbendick commented on a diff in pull request #5313: Orc: Support row group bloom filters

Reply via email to