vinothchandar commented on pull request #3964:
URL: https://github.com/apache/hudi/pull/3964#issuecomment-989335863
@leesf Love to understand the plan going forward here and how we plan to
migrate the existing v1 write path onto the v2 APIs. Specifically, current v1
upsert pipeline consists of the following logical stages ` preCombine -> index
-> partition -> write` before committing out the files. In other words, we
benefit from v1 API providing ways to shuffle the dataframe further before
writing to disk and IIUC v2 takes this flexibility away?
Assuming I am correct (and spark has not introduced any new APIs that help
us mitigate this), should we do the following?
- introduce a new `hudiv2` datasource i.e `spark.write.format("hudiv2")`
that just supports bulk_insert on the datasource write path.
- We also add a new `SparkDatasetWriteClient` which exposes methods for
upsert,delete, .. and we use that as the basis for our SQL/DML layer as well.
- We continue to support the v1 `hudi` datasource as-is for sometime. There
are lots of users who like how they can do upserts/deletes by executing a
`spark.write.format("hudi").option()...`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]