[GitHub] [hudi] RajasekarSribalan opened a new issue #2238: [SUPPORT] _hoodie_is_deleted support for Spark Datasource API in hudi 0.5.2-incubating

GitBox Mon, 09 Nov 2020 20:50:19 -0800


RajasekarSribalan opened a new issue #2238:
URL: https://github.com/apache/hudi/issues/2238



   Hi All, I have query regarding CDC using hudi.
   
   My questions is ,  I am using SPARK Datasource API for upserts and delete on 
HUDI. What is the best way of doing deletes in hudi? 
   Our code flow is , 
   read Kafka -> persist DF in memory -> filter upserts > Write to Hudi -> 
Filter Deletes -> Write to hudi.. 
   
   Is this the right of handling both upsert and deletes from incoming streams… 
The problem with this approach is, hudi does indexing twice for a single batch 
of records as we do upsert separately and delete separately. I would like to 
have your suggestions for improving our pipeline.
   can we use “_hoodie_is_deleted” in Spark Datasource API. We can append a new 
column with _hoodie_is_deleted as true for delete records and false for 
insert/update records.. If we use “_hoodie_is_deleted”, will hudi hard delete 
the row or does it make it null? Pls confirm. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] RajasekarSribalan opened a new issue #2238: [SUPPORT] _hoodie_is_deleted support for Spark Datasource API in hudi 0.5.2-incubating

Reply via email to