[GitHub] [hudi] joaqs190 opened a new issue #1803: [SUPPORT] hoodie.datasource.write.precombine.field is ignored

GitBox Mon, 06 Jul 2020 14:17:19 -0700


joaqs190 opened a new issue #1803:
URL: https://github.com/apache/hudi/issues/1803

**_Tips before filing an issue_**

- Have you gone through our
[FAQs](https://cwiki.apache.org/confluence/display/HUDI/FAQ)?

- Join the mailing list to engage in conversations and get faster support at
[email protected].

- If you have triaged this as a bug, then file an
[issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.

**Describe the problem you faced**

Hi Hudi team!

The records in my use case need to leverage
hoodie.datasource.write.precombine.field. The records have a multi key and
often there are multiple records with the same key, same timestamp and the
precombine field is used to break any ties.

During tests with 0.5.2 and 0.6.0 this precombine field is not taken into
consideration and the last update is an intermediary value, see example.

Example:

Ouput of the Records in S3 generated by AWS DMS:

Record 1:
"Op": "U",
"timestamp": "2020-07-06 18:57:47.000000",
"items": 61

Record 2:
"Op": "U",
"timestamp": "2020-07-06 18:57:48.000000",
"items": 62

Record 3:
"Op": "U",
"timestamp": "2020-07-06 18:57:52.000000",
"items": 63

Record 4:
"Op": "U",
"timestamp": "2020-07-06 18:57:52.000000",
"items": 64

Record 5:
"Op": "U",
"timestamp": "2020-07-06 18:57:52.000000",
"items": 65

If I visit the Hudi Deltastreamer form within Spark, Record 3 ("items" set
to 63) was written to the dataset but not Record 5 (with "items" 65).

**To Reproduce**

Steps to reproduce the behavior:

1. Leverage
https://cwiki.apache.org/confluence/display/HUDI/2020/01/20/Change+Capture+Using+AWS+Database+Migration+Service+and+Hudi
2. add a sql transform to extract an unique number from the input file (this
number exists in a column in the dataset, it is unique, the transform only puts
it in its own column)

**Expected behavior**

A clear and concise description of what you expected to happen.

Record 5 from the example above should have been the value for the record
key. I expected that Deltastreamer ordered the records with the same record key
and timestamp to be ordered by the precombine field. Instead Deltastreamer uses
the first record for that specific time stamp and record key and ignores the
records that come after and have higher precombine field value.

**Environment Description**
EMR
* Hudi version :
0.5.2 and 0.6.0
* Spark version :
2.4.5
* Hive version :
x
* Hadoop version :
x
* Storage (HDFS/S3/GCS..) :
S3
* Running on Docker? (yes/no) :
no

**Additional context**

Add any other context about the problem here.

**Stacktrace**

```Add the stacktrace of the error.```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] joaqs190 opened a new issue #1803: [SUPPORT] hoodie.datasource.write.precombine.field is ignored

Reply via email to