RajasekarSribalan opened a new issue #2238: URL: https://github.com/apache/hudi/issues/2238
Hi All, I have query regarding CDC using hudi. My questions is , I am using SPARK Datasource API for upserts and delete on HUDI. What is the best way of doing deletes in hudi? Our code flow is , read Kafka -> persist DF in memory -> filter upserts > Write to Hudi -> Filter Deletes -> Write to hudi.. Is this the right of handling both upsert and deletes from incoming streams… The problem with this approach is, hudi does indexing twice for a single batch of records as we do upsert separately and delete separately. I would like to have your suggestions for improving our pipeline. can we use “_hoodie_is_deleted” in Spark Datasource API. We can append a new column with _hoodie_is_deleted as true for delete records and false for insert/update records.. If we use “_hoodie_is_deleted”, will hudi hard delete the row or does it make it null? Pls confirm. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
