[GitHub] [hudi] Kavin88 edited a comment on issue #3831: Deltastreamer through Pyspark/livy

GitBox Tue, 09 Nov 2021 00:41:30 -0800


Kavin88 edited a comment on issue #3831:
URL: https://github.com/apache/hudi/issues/3831#issuecomment-948232383



   @xushiyan 1. As of now, I am directly doing the spark submit on the EMR 
cluster for deltastreamer run. Want  to understand if deltastreamer can be used 
same as hudi datasource writer. Params we would pass in datasource writer in 
pyspark is given below. I am not able to get how to pass the deltastreamers 
params in python/spark code or through livy submit. Not able to find how  to 
pass --continuous, source class name , source ordering field ,etc in below 
hudiOptions. Is this viable ?
   
   hudiOptions = {
   "hoodie.table.name": "hudi_test",
   "hoodie.datasource.write.recordkey.field": "id",
   "hoodie.datasource.write.precombine.field": "last_update_time",
   "hoodie.upsert.shuffle.parallelism": 1,
   "hoodie.insert.shuffle.parallelism": 1,
   'hoodie.datasource.write.storage.type': 'MERGE_ON_READ'
   }
   
   
inputdf.write.format('org.apache.hudi').option('hoodie.datasource.write.operation',
 'insert').options(**hudiOptions).mode('overwrite').save('storagepath')
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] Kavin88 edited a comment on issue #3831: Deltastreamer through Pyspark/livy

Reply via email to