joaqs190 opened a new issue #1803:
URL: https://github.com/apache/hudi/issues/1803


   **_Tips before filing an issue_**
   
   - Have you gone through our 
[FAQs](https://cwiki.apache.org/confluence/display/HUDI/FAQ)?
   
   - Join the mailing list to engage in conversations and get faster support at 
[email protected].
   
   - If you have triaged this as a bug, then file an 
[issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   **Describe the problem you faced**
   
   Hi Hudi team! 
   
   The records in my use case need to leverage 
hoodie.datasource.write.precombine.field. The records have a multi key and 
often there are multiple records with the same key, same timestamp and the 
precombine field is used to break any ties. 
   
   During tests with 0.5.2 and 0.6.0 this precombine field is not taken into 
consideration and the last update is an intermediary value, see example. 
   
   Example:
   
   Ouput of the Records in S3 generated by AWS DMS:
   
   Record 1:
   "Op": "U",
    "timestamp": "2020-07-06 18:57:47.000000",
    "items": 61
   
   Record 2:
    "Op": "U",
    "timestamp": "2020-07-06 18:57:48.000000",
    "items": 62
   
   Record 3:
   "Op": "U",
    "timestamp": "2020-07-06 18:57:52.000000",
    "items": 63
   
   Record 4:
    "Op": "U",
     "timestamp": "2020-07-06 18:57:52.000000",
     "items": 64
   
   Record 5:
   "Op": "U",
    "timestamp": "2020-07-06 18:57:52.000000",
    "items": 65
   
   If I visit the Hudi Deltastreamer form within Spark, Record 3  ("items" set 
to 63) was written to the dataset but not Record 5 (with "items" 65).
   
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Leverage 
https://cwiki.apache.org/confluence/display/HUDI/2020/01/20/Change+Capture+Using+AWS+Database+Migration+Service+and+Hudi
   2. add a sql transform to extract an unique number from the input file (this 
number exists in a column in the dataset, it is unique, the transform only puts 
it in its own column)
   
   
   **Expected behavior**
   
   A clear and concise description of what you expected to happen.
   
   Record 5 from the example above should have been the value for the record 
key. I expected that Deltastreamer ordered the records with the same record key 
and timestamp to be ordered by the precombine field. Instead Deltastreamer uses 
the first record for that specific time stamp and record key and ignores the 
records that come after and have higher precombine field value. 
   
   **Environment Description**
   EMR
   * Hudi version :
   0.5.2 and 0.6.0
   * Spark version :
   2.4.5
   * Hive version :
   x
   * Hadoop version :
   x
   * Storage (HDFS/S3/GCS..) :
   S3
   * Running on Docker? (yes/no) :
   no
   
   **Additional context**
   
   Add any other context about the problem here.
   
   **Stacktrace**
   
   ```Add the stacktrace of the error.```
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to