[GitHub] [hudi] reenarosid opened a new issue #1885: [SUPPORT] MISSING RECORDS

GitBox Wed, 29 Jul 2020 00:56:26 -0700


reenarosid opened a new issue #1885:
URL: https://github.com/apache/hudi/issues/1885



   
   Issue: I made a huge insert into hudi Table, but only 10th of the records 
were inserted. 
   To add more, I was having a partitionless dataset.
   I also made sure that de-duplication was on False ( i know by default it was 
set false, just to ensure I made an explicit statement).
   Below are the set of commanda that i executed .
   
   
   df = spark.read.parquet(PATH+"/*")
   # took 2000 records of a dataset
   df1=df.limit(2000)
   # took 1000 of the above and inserted it first and then tried appending the 
rest. ( ensuring duplicates)
   set1= df1.limit(1000)
     
   First insert was set1, then I tried inserting df1(a superset of set1) .
   
   hudi_options = {
     'hoodie.table.name': HUDI_TABLE_NAME,
     'hoodie.datasource.write.recordkey.field': 'f1',
    "hoodie.datasource.write.insert.drop.duplicates":"false",
     'hoodie.datasource.write.table.name': HUDI_TABLE_NAME,
     'hoodie.datasource.write.operation': 'insert',
     'hoodie.datasource.write.precombine.field': 'f1',
     'hoodie.upsert.shuffle.parallelism': 1,
     'hoodie.insert.shuffle.parallelism': 1,
     "hoodie.cleaner.policy" : "KEEP_LATEST_FILE_VERSIONS",
     'hoodie.datasource.': 'COPY_ON_WRITE', #'COPY_ON_WRITE',MERGE_ON_READ
     "hoodie.cleaner.commits.retained": "1",
     "hoodie.cleaner.fileversions.retained": "1",
     "hoodie.parquet.min.file.size":6221225472,
   }
   
   
   set1.write.format("org.apache.hudi"). \
     options(**hudi_options). \
     mode("overwrite"). \
     save(HUDI_PATH)
   
   ----------------------- second insertion -----
   hudi_options = {
     'hoodie.table.name': HUDI_TABLE_NAME,
     'hoodie.datasource.write.recordkey.field': 'f1',
    "hoodie.datasource.write.insert.drop.duplicates":"false",
     'hoodie.datasource.write.table.name': HUDI_TABLE_NAME,
     'hoodie.datasource.write.operation': 'upsert',
     'hoodie.datasource.write.precombine.field': 'f2',
     'hoodie.upsert.shuffle.parallelism': 1,
     'hoodie.insert.shuffle.parallelism': 1,
     "hoodie.cleaner.policy" : "KEEP_LATEST_FILE_VERSIONS",
     'hoodie.datasource.': 'COPY_ON_WRITE', #'COPY_ON_WRITE',MERGE_ON_READ
     "hoodie.cleaner.commits.retained": "1",
     "hoodie.cleaner.fileversions.retained": "1",
     "hoodie.parquet.min.file.size":6221225472,
   }
   
   df1.write.format("org.apache.hudi"). \
     options(**hudi_options). \
     mode("append"). \
     save(HUDI_PATH)
   
   
   But when I look at the count I see that only a few records were inserted. ( 
1043 instead 3000 in my case).
    Field f1 had been duplicated in my data source.
   
   
   
   
   
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] reenarosid opened a new issue #1885: [SUPPORT] MISSING RECORDS

Reply via email to