[ 
https://issues.apache.org/jira/browse/HUDI-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-2774:
--------------------------------------
    Description: 
Setup:

Started deltastreamer with parquet dfs source. source folder did not have any 
data as such. Enabled async clustering with below props

```

hoodie.clustering.async.max.commits=2

hoodie.clustering.plan.strategy.sort.columns=type,id

```

Added 1 file to the source folder. and deltastreamer failed during this. commit 
went through fine. 1st replace commit got scheduled. but the timeline shows 2nd 
one as well. 

 

clustering plan seems to be same in both requested meta files
{code:java}
grep "2b9b3f9d-f68c-4404-8352-1708089d2cca-0_13-49-202_20211116123721000" 
/tmp/hudi-deltastreamer-gh-mw/.hoodie/* | grep replacecommit
grep: /tmp/hudi-deltastreamer-gh-mw/.hoodie/archived: Is a directory
Binary file 
/tmp/hudi-deltastreamer-gh-mw/.hoodie/20211116123724586.replacecommit.requested 
matches
Binary file 
/tmp/hudi-deltastreamer-gh-mw/.hoodie/20211116123725199.replacecommit.requested 
matches {code}
 

 

timeline

!Screen Shot 2021-11-16 at 12.42.20 PM.png!

 

 

  was:
Setup:

Started deltastreamer with parquet dfs source. source folder did not have any 
data as such. Enabled async clustering with below props

```

hoodie.clustering.async.max.commits=2

hoodie.clustering.plan.strategy.sort.columns=type,id

```

Added 1 file to the source folder. and deltastreamer failed during this. commit 
went through fine. 1st replace commit got scheduled. here is my hunch : and I 
guess in subsequent round of deltastreamer sync, it tries to schedule again and 
fails. 

 

clustering plan seems to be same in both requested meta files
{code:java}
grep "2b9b3f9d-f68c-4404-8352-1708089d2cca-0_13-49-202_20211116123721000" 
/tmp/hudi-deltastreamer-gh-mw/.hoodie/* | grep replacecommit
grep: /tmp/hudi-deltastreamer-gh-mw/.hoodie/archived: Is a directory
Binary file 
/tmp/hudi-deltastreamer-gh-mw/.hoodie/20211116123724586.replacecommit.requested 
matches
Binary file 
/tmp/hudi-deltastreamer-gh-mw/.hoodie/20211116123725199.replacecommit.requested 
matches {code}
 

 

timeline

!Screen Shot 2021-11-16 at 12.42.20 PM.png!

 

 


> Async Clustering via deltstreamer fails with IllegalStateException: Duplicate 
> key [==>20211116123724586__replacecommit__INFLIGHT]
> ---------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HUDI-2774
>                 URL: https://issues.apache.org/jira/browse/HUDI-2774
>             Project: Apache Hudi
>          Issue Type: Bug
>            Reporter: sivabalan narayanan
>            Assignee: Sagar Sumit
>            Priority: Blocker
>             Fix For: 0.10.0
>
>         Attachments: Screen Shot 2021-11-16 at 12.42.20 PM.png
>
>
> Setup:
> Started deltastreamer with parquet dfs source. source folder did not have any 
> data as such. Enabled async clustering with below props
> ```
> hoodie.clustering.async.max.commits=2
> hoodie.clustering.plan.strategy.sort.columns=type,id
> ```
> Added 1 file to the source folder. and deltastreamer failed during this. 
> commit went through fine. 1st replace commit got scheduled. but the timeline 
> shows 2nd one as well. 
>  
> clustering plan seems to be same in both requested meta files
> {code:java}
> grep "2b9b3f9d-f68c-4404-8352-1708089d2cca-0_13-49-202_20211116123721000" 
> /tmp/hudi-deltastreamer-gh-mw/.hoodie/* | grep replacecommit
> grep: /tmp/hudi-deltastreamer-gh-mw/.hoodie/archived: Is a directory
> Binary file 
> /tmp/hudi-deltastreamer-gh-mw/.hoodie/20211116123724586.replacecommit.requested
>  matches
> Binary file 
> /tmp/hudi-deltastreamer-gh-mw/.hoodie/20211116123725199.replacecommit.requested
>  matches {code}
>  
>  
> timeline
> !Screen Shot 2021-11-16 at 12.42.20 PM.png!
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to