jardel-lima opened a new issue #3879:
URL: https://github.com/apache/hudi/issues/3879


   
   **Describe the problem you faced**
   
   I am trying to migrate some tables to hudi format, but I am facing some 
issues. We have a 7GB (snnapy compacted) table with 200M rows, 49 columns and 
just one partition. Using PySpark DataSource the migration finished without any 
error, although I notice that about 10000 rows were missing in the hudi table. 
I have tried to migrate the table again, but the same issue happened. 
   
   **To Reproduce**
   
   Migrate a huge table with a single partition using bulk_insert operation.
   
   **Expected behavior**
   
   I expected that all row would be migrated
   
   **Environment Description**
   
   * Hudi version : 0.9
   
   * Spark version : 3.0.0
   
   * Hadoop version : 3.2.1
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : no
   
   
   **Additional context**
   
   - We were able to migrate others table without any issue;
   - The record key used is unique throughout the table;
   -  For each try different records were not migrated.  As example, the record 
A was migrated in the first try but it was not migrated in the second try;
   - The same number of register was not migrated in both tries;
   - I cleaned up all data from before each try.
   
   Hudi Options:
   ```
   {'hoodie.table.name': 'table_a',
    'hoodie.datasource.write.operation': 'bulk_insert',
   'hoodie.datasource.write.recordkey.field': 'KEY', 
   'hoodie.datasource.write.partitionpath.field': 'YEAR', 
   'hoodie.datasource.write.precombine.field': 'REF_DATE',
    'hoodie.datasource.write.hive_style_partitioning': 'true', 
   'hoodie.datasource.hive_sync.enable': 'true', 
   'hoodie.datasource.hive_sync.database': 'database_a', 
   'hoodie.datasource.hive_sync.table': 'table_a', 
   'hoodie.datasource.hive_sync.partition_fields': 'YEAR', 
   'hoodie.datasource.hive_sync.support_timestamp': 'true', 
   'hoodie.bulkinsert.shuffle.parallelism': 17,
    'hoodie.cleaner.commits.retained': 3, 
   'hoodie.datasource.hive_sync.partition_extractor_class': 
'org.apache.hudi.hive.MultiPartKeysValueExtractor'
   }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to