[
https://issues.apache.org/jira/browse/HUDI-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
sivabalan narayanan updated HUDI-2774:
--------------------------------------
Description:
Setup:
Started deltastreamer with parquet dfs source. source folder did not have any
data as such. Enabled async clustering with below props
```
hoodie.clustering.async.max.commits=2
hoodie.clustering.plan.strategy.sort.columns=type,id
```
Added 1 file to the source folder. and deltastreamer failed during this. commit
went through fine. 1st replace commit got scheduled. but the timeline shows 2nd
one as well.
clustering plan seems to be same in both requested meta files
{code:java}
grep "2b9b3f9d-f68c-4404-8352-1708089d2cca-0_13-49-202_20211116123721000"
/tmp/hudi-deltastreamer-gh-mw/.hoodie/* | grep replacecommit
grep: /tmp/hudi-deltastreamer-gh-mw/.hoodie/archived: Is a directory
Binary file
/tmp/hudi-deltastreamer-gh-mw/.hoodie/20211116123724586.replacecommit.requested
matches
Binary file
/tmp/hudi-deltastreamer-gh-mw/.hoodie/20211116123725199.replacecommit.requested
matches {code}
timeline
!Screen Shot 2021-11-16 at 12.42.20 PM.png!
was:
Setup:
Started deltastreamer with parquet dfs source. source folder did not have any
data as such. Enabled async clustering with below props
```
hoodie.clustering.async.max.commits=2
hoodie.clustering.plan.strategy.sort.columns=type,id
```
Added 1 file to the source folder. and deltastreamer failed during this. commit
went through fine. 1st replace commit got scheduled. here is my hunch : and I
guess in subsequent round of deltastreamer sync, it tries to schedule again and
fails.
clustering plan seems to be same in both requested meta files
{code:java}
grep "2b9b3f9d-f68c-4404-8352-1708089d2cca-0_13-49-202_20211116123721000"
/tmp/hudi-deltastreamer-gh-mw/.hoodie/* | grep replacecommit
grep: /tmp/hudi-deltastreamer-gh-mw/.hoodie/archived: Is a directory
Binary file
/tmp/hudi-deltastreamer-gh-mw/.hoodie/20211116123724586.replacecommit.requested
matches
Binary file
/tmp/hudi-deltastreamer-gh-mw/.hoodie/20211116123725199.replacecommit.requested
matches {code}
timeline
!Screen Shot 2021-11-16 at 12.42.20 PM.png!
> Async Clustering via deltstreamer fails with IllegalStateException: Duplicate
> key [==>20211116123724586__replacecommit__INFLIGHT]
> ---------------------------------------------------------------------------------------------------------------------------------
>
> Key: HUDI-2774
> URL: https://issues.apache.org/jira/browse/HUDI-2774
> Project: Apache Hudi
> Issue Type: Bug
> Reporter: sivabalan narayanan
> Assignee: Sagar Sumit
> Priority: Blocker
> Fix For: 0.10.0
>
> Attachments: Screen Shot 2021-11-16 at 12.42.20 PM.png
>
>
> Setup:
> Started deltastreamer with parquet dfs source. source folder did not have any
> data as such. Enabled async clustering with below props
> ```
> hoodie.clustering.async.max.commits=2
> hoodie.clustering.plan.strategy.sort.columns=type,id
> ```
> Added 1 file to the source folder. and deltastreamer failed during this.
> commit went through fine. 1st replace commit got scheduled. but the timeline
> shows 2nd one as well.
>
> clustering plan seems to be same in both requested meta files
> {code:java}
> grep "2b9b3f9d-f68c-4404-8352-1708089d2cca-0_13-49-202_20211116123721000"
> /tmp/hudi-deltastreamer-gh-mw/.hoodie/* | grep replacecommit
> grep: /tmp/hudi-deltastreamer-gh-mw/.hoodie/archived: Is a directory
> Binary file
> /tmp/hudi-deltastreamer-gh-mw/.hoodie/20211116123724586.replacecommit.requested
> matches
> Binary file
> /tmp/hudi-deltastreamer-gh-mw/.hoodie/20211116123725199.replacecommit.requested
> matches {code}
>
>
> timeline
> !Screen Shot 2021-11-16 at 12.42.20 PM.png!
>
>
--
This message was sent by Atlassian Jira
(v8.20.1#820001)