[GitHub] [hudi] vinothchandar commented on pull request #3964: [HUDI-2732][RFC-38] Spark Datasource V2 Integration

GitBox Wed, 08 Dec 2021 16:24:18 -0800


vinothchandar commented on pull request #3964:
URL: https://github.com/apache/hudi/pull/3964#issuecomment-989335863



   @leesf Love to understand the plan going forward here and how we plan to 
migrate the existing v1 write path onto the v2 APIs. Specifically, current v1 
upsert pipeline consists of the following logical stages ` preCombine -> index 
-> partition -> write` before committing out the files.  In other words, we 
benefit from v1 API providing ways to shuffle the dataframe further before 
writing to disk and IIUC v2 takes this flexibility away? 
   
   Assuming I am correct (and spark has not introduced any new APIs that help 
us mitigate this), should we do the following?
   
   - introduce a new `hudiv2` datasource i.e `spark.write.format("hudiv2")` 
that just supports bulk_insert on the datasource write path.
   - We also add a new `SparkDatasetWriteClient` which exposes methods for 
upsert,delete, .. and we use that as the basis for our SQL/DML layer as well.
   - We continue to support the v1 `hudi` datasource as-is for sometime. There 
are lots of users who like how they can do upserts/deletes by executing a 
`spark.write.format("hudi").option()...` 
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] vinothchandar commented on pull request #3964: [HUDI-2732][RFC-38] Spark Datasource V2 Integration

Reply via email to