Re: [I] UPSERTs are taking time [hudi]

via GitHub Fri, 10 Nov 2023 06:31:25 -0800


darlatrade commented on issue #9976:
URL: https://github.com/apache/hudi/issues/9976#issuecomment-1805844791


   I am trying to initialize new table with RLI. I need to load the history 
first which is has  3210407531 records and 520GB data.
   
   Spark context is shutting down to load this much data..Also number objects 
are huge as below screenshot
   <img width="831" alt="image" 
src="https://github.com/apache/hudi/assets/109939327/34b21e41-5631-498a-a15f-d3d71ba728c3";>
   
   **Hoodie config :**
       "className":
       "org.apache.hudi",
       "hoodie.table.name": tgt_tbl,
       "hoodie.datasource.write.recordkey.field": "id",
       "hoodie.datasource.write.precombine.field": "eff_fm_cent_tz",
       "hoodie.datasource.write.operation": "upsert",
       "hoodie.datasource.write.keygenerator.class": 
"org.apache.hudi.keygen.ComplexKeyGenerator",
       "hoodie.datasource.write.partitionpath.field": "year,month",
       "hoodie.datasource.hive_sync.support_timestamp": "true",
       "hoodie.datasource.hive_sync.enable": "true",
       "hoodie.datasource.hive_sync.assume_date_partitioning": "false",
       "hoodie.datasource.hive_sync.table": tgt_tbl,
       "hoodie.datasource.hive_sync.use_jdbc": "false",
       "hoodie.datasource.hive_sync.mode": "hms",
       "hoodie.datasource.hive_sync.partition_extractor_class": 
"org.apache.hudi.hive.MultiPartKeysValueExtractor",
       "hoodie.datasource.write.hive_style_partitioning": "true",
       "hoodie.bulkinsert.shuffle.parallelism": hudi_Insert_parallelism,
       "hoodie.index.type": "RECORD_INDEX",
       "hoodie.metadata.record.index.enable":"true",
       "hoodie.metadata.enable": "true"   
   
   Error:
   <img width="1789" alt="image" 
src="https://github.com/apache/hudi/assets/109939327/ec7c203d-b720-427a-91e0-c4e7a43615a0";>
   
   Can you suggest what parameters need to be used to load this data? I need to 
load history first before starting deltas.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] UPSERTs are taking time [hudi]

Reply via email to