sydneyhoran commented on issue #9143:
URL: https://github.com/apache/hudi/issues/9143#issuecomment-1632896844

   Hey @ad1happy2go @danny0405 @soumilshah1995 unfortunately with the latest 
master branch, the non-op deletes are still happening and the job is not 
processing our deletes completely. However, it is just no longer throwing 
DeltaSync global errors for "Error for key:HoodieKey" anymore, so the 
`commit-on-errors` flag doesn't appear to be needed with the latest.
   
   Test steps: Insert records, see the records in S3/redshift table. We receive 
3 kafka messages for each record - an create, update and then delete. The first 
2 come in with all details populated in the `"after"` field. The 3rd one comes 
with only the ID populated in the `"before"` with the `inserted_at` as null, so 
Deltastreamer looks for the record in the 1970/01/01 partition instead of the 
correct partition.
   
   This is what the first 2 Kafka messages look like for record insertion `c` & 
then update `u`:
   
   <img width="1435" alt="Screenshot 2023-07-11 at 9 24 06 PM" 
src="https://github.com/apache/hudi/assets/105753252/558610b1-d464-4b21-8878-5593af89c24a";>
   
   
   And then the `u` is followed by a `d` on the record we were testing:
   
   <img width="1420" alt="Screenshot 2023-07-11 at 9 24 18 PM" 
src="https://github.com/apache/hudi/assets/105753252/5c96380e-a12c-4f63-99b2-4d9265402d60";>
   
   
   Based on the hoodie.commit file, we see that 5 records were to be deleted 
from the 1970/01/01 partition (this test was with 5 inserts/deletes):
   
   <img width="905" alt="Screenshot 2023-07-11 at 9 18 41 PM" 
src="https://github.com/apache/hudi/assets/105753252/a3f4f0f0-55c7-4a07-85f5-11d8753e2a8e";>
   
   
   In this .commit file from another test, you can see that the records were 
inserted into the current `2023/07/11` partition but the delete operation was 
on the `1970/01/01` partition:
   
   <img width="901" alt="Screenshot 2023-07-12 at 12 53 44 PM" 
src="https://github.com/apache/hudi/assets/105753252/72f0e73d-cd95-4d87-9408-8f1a9de3d32d";>
   
   
   The records are still present in Redshift after the `"d"` operation:
   
   <img width="759" alt="Screenshot 2023-07-12 at 12 10 41 PM" 
src="https://github.com/apache/hudi/assets/105753252/1df35d99-ff3d-4d07-ae60-3ef9d06895ff";>
   
   
   One more thing I can think of is that we use the timestamp keygenerator 
`hoodie.deltastreamer.keygen.timebased.timestamp.type` = EPOCHMICROSECONDS. We 
used a fresh master jar for this test, with one exception - the only thing 
custom that had to be added was this timestamp type EPOCHMICROSECONDS added to 
`TimestampBasedAvroKeyGenerator.java`. I'm wondering if the delete logic for 
finding a record with no timestamp attached somehow defaults to milliseconds? 
But that wouldn't make too much sense as the partition keyGen is `null` in the 
incoming record...
   
   How can Deltastreamer find out which partition to delete a record from if 
its not told the partition timestamp?
   
   Any thoughts on this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to