[jira] [Updated] (HUDI-5839) Insert in non-strict mode deduplices dataset in "append" mode - spark

kazdy (Jira) Thu, 23 Feb 2023 08:56:58 -0800


     [ 
https://issues.apache.org/jira/browse/HUDI-5839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


kazdy updated HUDI-5839:
------------------------
    Description: 
There seem to be a bug with non-strict insert mode when precombine is not 
defined (but I have not checked for when it is).
When using spark datasource it can insert duplicates only in overwrite mode or 
append mode when data is inserted to the table for the first time, but if I 
want to insert in append mode for the second time it deduplicates the dataset 
as if it was working in upsert mode. Found in master (0.13.0).

I happens to be a regression, because I'm using this functionality in Hudi 
0.12.1.
{code:java}
from pyspark.sql.functions import expr

path = "/tmp/huditbl"

opt_insert = {
    'hoodie.table.name': 'huditbl',
    'hoodie.datasource.write.recordkey.field': 'keyid',
    'hoodie.datasource.write.table.name': 'huditbl',
    'hoodie.datasource.write.operation': 'insert',
    'hoodie.sql.insert.mode': 'non-strict',
    'hoodie.upsert.shuffle.parallelism': 2,
    'hoodie.insert.shuffle.parallelism': 2,
    'hoodie.combine.before.upsert': 'false',
    'hoodie.combine.before.insert': 'false',
    'hoodie.datasource.write.insert.drop.duplicates': 'false'
}

df = spark.range(0, 10).toDF("keyid") \
  .withColumn("age", expr("keyid + 1000"))

df.write.format("hudi"). \
options(**opt_insert). \
mode("overwrite"). \
save(path)

spark.read.format("hudi").load(path).count() # returns 10

df = df.union(df) # creates duplicates
df.write.format("hudi"). \
options(**opt_insert). \
mode("append"). \
save(path)

spark.read.format("hudi").load(path).count() # returns 10 but should return 20 

# note
# this works:
df = df.union(df) # creates duplicates 
df.write.format("hudi"). \ 
options(**opt_insert). \ 
mode("overwrite"). \ 
save(path)

spark.read.format("hudi").load(path).count() # returns 20 as it should{code}
 

  was:
There seem to be a bug with non-strict insert mode when precombine is not 
defined (but I have not checked for when it is).
When using spark datasource it can insert duplicates only in overwrite mode or 
append mode when data is inserted to the table for the first time, but if I 
want to insert in append mode for the second time it deduplicates the dataset 
as if it was working in upsert mode. Found in master (0.13.0).

I happens to be a regression, because I'm using this functionality in Hudi 
0.12.1.
{code:java}
from pyspark.sql.functions import expr

path = "/tmp/huditbl"

opt_insert = {
    'hoodie.table.name': 'huditbl',
    'hoodie.datasource.write.recordkey.field': 'keyid',
    'hoodie.datasource.write.table.name': 'huditbl',
    'hoodie.datasource.write.operation': 'insert',
    'hoodie.sql.insert.mode': 'non-strict',
    'hoodie.upsert.shuffle.parallelism': 2,
    'hoodie.insert.shuffle.parallelism': 2,
    'hoodie.combine.before.upsert': 'false',
    'hoodie.combine.before.insert': 'false',
    'hoodie.datasource.write.insert.drop.duplicates': 'false'
}

df = spark.range(0, 10).toDF("keyid") \
  .withColumn("age", expr("keyid + 1000"))

df.write.format("hudi"). \
options(**opt_insert). \
mode("overwrite"). \
save(path)

spark.read.format("hudi").load(path).count() # returns 10

df = df.union(df) # creates duplicates
df.write.format("hudi"). \
options(**opt_insert). \
mode("append"). \
save(path)

spark.read.format("hudi").load(path).count() # returns 10 but should return 20 
{code}


> Insert in non-strict mode deduplices dataset in "append" mode - spark
> ---------------------------------------------------------------------
>
>                 Key: HUDI-5839
>                 URL: https://issues.apache.org/jira/browse/HUDI-5839
>             Project: Apache Hudi
>          Issue Type: Bug
>          Components: spark, writer-core
>    Affects Versions: 0.13.0
>            Reporter: kazdy
>            Priority: Major
>
> There seem to be a bug with non-strict insert mode when precombine is not 
> defined (but I have not checked for when it is).
> When using spark datasource it can insert duplicates only in overwrite mode 
> or append mode when data is inserted to the table for the first time, but if 
> I want to insert in append mode for the second time it deduplicates the 
> dataset as if it was working in upsert mode. Found in master (0.13.0).
> I happens to be a regression, because I'm using this functionality in Hudi 
> 0.12.1.
> {code:java}
> from pyspark.sql.functions import expr
> path = "/tmp/huditbl"
> opt_insert = {
>     'hoodie.table.name': 'huditbl',
>     'hoodie.datasource.write.recordkey.field': 'keyid',
>     'hoodie.datasource.write.table.name': 'huditbl',
>     'hoodie.datasource.write.operation': 'insert',
>     'hoodie.sql.insert.mode': 'non-strict',
>     'hoodie.upsert.shuffle.parallelism': 2,
>     'hoodie.insert.shuffle.parallelism': 2,
>     'hoodie.combine.before.upsert': 'false',
>     'hoodie.combine.before.insert': 'false',
>     'hoodie.datasource.write.insert.drop.duplicates': 'false'
> }
> df = spark.range(0, 10).toDF("keyid") \
>   .withColumn("age", expr("keyid + 1000"))
> df.write.format("hudi"). \
> options(**opt_insert). \
> mode("overwrite"). \
> save(path)
> spark.read.format("hudi").load(path).count() # returns 10
> df = df.union(df) # creates duplicates
> df.write.format("hudi"). \
> options(**opt_insert). \
> mode("append"). \
> save(path)
> spark.read.format("hudi").load(path).count() # returns 10 but should return 
> 20 
> # note
> # this works:
> df = df.union(df) # creates duplicates 
> df.write.format("hudi"). \ 
> options(**opt_insert). \ 
> mode("overwrite"). \ 
> save(path)
> spark.read.format("hudi").load(path).count() # returns 20 as it should{code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5839) Insert in non-strict mode deduplices dataset in "append" mode - spark

Reply via email to