[I] [SUPPORT]insert_overwrite mode writing 2 times more duplicates [hudi]

via GitHub Mon, 06 Nov 2023 02:17:58 -0800


rishabhreply opened a new issue, #9992:
URL: https://github.com/apache/hudi/issues/9992


   **Describe the problem you faced**
   
   I am using write mode insert_overwrite. I have partitions based on the date 
information from the ingested filename (THis is my partition key). The record 
key is all the columns in the file as I don't have a unique value column. I am 
not using any Precombine key as this mode does not require that.
   
   I ingested a file with name, 2021042300unck.csv. I received the result as 
expected, i.e., 10 records partitioned by /20210423/ in S3 bucket.
   I then created two exact copies of the above mentioned file by replacing 
some digits in the filename as follows:
   2021042301unck.csv
   2021042302unck.csv
   
   Using insert_overwrite, I expected 20 records to be available by Hudi. 
However, I am getting 40 records. 
   
   My questions:
   1. Why am I seeing 2 times more duplicates.
   2. Is there a way to de-duplicate from incoming data using this mode? If 
not, then which mode is best for such scenarios?
   
   **Environment Description**
   
   * Hudi version : 0.10.1
   
   * Spark version : 3.1.1
   
   * Hive version : 
   
   * Hadoop version :
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : no, on AWS Glue
   
   Following Hudi options are set before writing:
               "hoodie.index.type": "BLOOM",  # BLOOM Indexes are not used 
during reads. They apply only to writing
               "hoodie.metadata.enable": "true",
               "hoodie.metadata.index.bloom.filter.enable": "true",
               "hoodie.metadata.index.bloom.filter.parallelism": 100,
               "hoodie.bloom.index.use.metadata": "true",
               "hoodie.datasource.write.storage.type": storage_type,
               "hoodie.datasource.hive_sync.enable": "true",
               "hoodie.datasource.hive_sync.use_jdbc": "false",
               "hoodie.datasource.hive_sync.mode": "hms",
               "hoodie.table.name": table_name,
               "hoodie.datasource.write.table.name": table_name,
               "hoodie.datasource.hive_sync.table": table_name,
               "path": table_location,
               "hoodie.datasource.hive_sync.database": database_name,
               "hoodie.bulkinsert.sort.mode": "GLOBAL_SORT", 
               "hudi_options["hoodie.datasource.write.precombine.field" : 
precombine_key
               "hoodie.datasource.write.keygenerator.class": 
"org.apache.hudi.keygen.ComplexKeyGenerator"


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [SUPPORT]insert_overwrite mode writing 2 times more duplicates [hudi]

Reply via email to