[GitHub] [iceberg] chenjunjiedada commented on a diff in pull request #7617: Parquet: skip writing bloom filter for deletes

via GitHub Thu, 25 May 2023 17:26:03 -0700


chenjunjiedada commented on code in PR #7617:
URL: https://github.com/apache/iceberg/pull/7617#discussion_r1206109480



##########
parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java:
##########
@@ -519,8 +512,8 @@ static Context deleteContext(Map<String, String> config) {
             compressionLevel,
             rowGroupCheckMinRecordCount,
             rowGroupCheckMaxRecordCount,
-            bloomFilterMaxBytes,
-            columnBloomFilterEnabled,
+            PARQUET_BLOOM_FILTER_MAX_BYTES_DEFAULT,
+            ImmutableMap.of(),

Review Comment:
   > If some datasets have high updates rate and generates a lot of large 
delete files. would the bloom filter for delete file be useful too?
   
   First, we don't have filter logic for the bloom filter right now.  Second, 
high-rate updates mostly produce position delete which doesn't contains a 
column to build a bloom filter unless the upstream is deleting a set of 
records. In that case, the records of the filtered data files should always 
pass the bloom filter of equality delete.
   
   > delete files are expected to be compacted/consolidated anyway. Hence the 
bloom filter on delete files never makes sense.
   
   I think so, the deletes impact read performance, as far as I know, all real 
productions need the async tasks to compact them to achieve good performance.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] chenjunjiedada commented on a diff in pull request #7617: Parquet: skip writing bloom filter for deletes

Reply via email to