[jira] [Updated] (HUDI-6946) Data Duplicates with range pruning while using hoodie.bloom.index.use.metadata

Danny Chen (Jira) Tue, 31 Oct 2023 17:12:10 -0700


     [ 
https://issues.apache.org/jira/browse/HUDI-6946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Danny Chen updated HUDI-6946:
-----------------------------
    Fix Version/s: 1.0.0

> Data Duplicates with range pruning while using hoodie.bloom.index.use.metadata
> ------------------------------------------------------------------------------
>
>                 Key: HUDI-6946
>                 URL: https://issues.apache.org/jira/browse/HUDI-6946
>             Project: Apache Hudi
>          Issue Type: Bug
>          Components: metadata, writer-core
>    Affects Versions: 0.13.1, 0.12.3, 0.14.0
>            Reporter: Aditya Goenka
>            Assignee: xi chaomin
>            Priority: Blocker
>              Labels: pull-request-available
>             Fix For: 1.0.0, 0.12.4, 0.14.1, 0.13.2
>
>         Attachments: WX20231019-094414.png
>
>
> Github Issue - 
> [https://github.com/apache/hudi/issues/9870]
>  
> Code to Reproduce - 
> ```
> from pyspark.sql.functions import col, concat, lit, max, min, substring, desc
> COW_TABLE_NAME="table_duplicates"
> PARTITION_FIELD = "year,month"
> PRECOMBINE_FIELD = "timestamp"
> COW_TABLE_LOCATION="file:///tmp/issue_9870_" + str(uuid.uuid4())
> hudi_options_opt = {
> "hoodie.table.name": COW_TABLE_NAME,
> "hoodie.table.type": "COPY_ON_WRITE",
> "hoodie.index.type": "BLOOM",
> "hoodie.datasource.write.recordkey.field": "id",
> "hoodie.datasource.write.partitionpath.field": PARTITION_FIELD,
> "hoodie.datasource.write.precombine.field": PRECOMBINE_FIELD,
> "hoodie.datasource.write.hive_style_partitioning": "true",
> "hoodie.metadata.enable": "true",
> "hoodie.bloom.index.use.metadata": "true",
> "hoodie.metadata.index.column.stats.enable": "true",
> "hoodie.parquet.small.file.limit": -1
> }
> COW_TABLE_LOCATION="file:///tmp/issue_9870_" + str(uuid.uuid4())
> inputDF = spark.createDataFrame(
> [
> ('1', "1", '1',2020,1),
> ('2', "1", '1',2020,1),
> ('3', "1", '1',2020,1)
> ],
> ["id", "value", "timestamp","year","month"]
> )
> (inputDF.write.format("org.apache.hudi")
> .option("hoodie.datasource.write.operation", "upsert")
> .options(**hudi_options_opt)
> .mode("append")
> .save(COW_TABLE_LOCATION))
> upsertDF = spark.createDataFrame(
> [
> ('3', "2", '1',2020,1)
> ],
> ["id", "value", "timestamp","year","month"]
> )
> (upsertDF.write.format("org.apache.hudi")
> .option("hoodie.datasource.write.operation", "upsert")
> .options(**hudi_options_opt)
> .mode("append")
> .save(COW_TABLE_LOCATION))
> spark.read.format('org.apache.hudi').load(COW_TABLE_LOCATION).groupBy("year","month","_hoodie_record_key").count().orderBy(desc("count")).show(100,
>  False)
> ```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-6946) Data Duplicates with range pruning while using hoodie.bloom.index.use.metadata

Reply via email to