[ 
https://issues.apache.org/jira/browse/HUDI-2559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17430012#comment-17430012
 ] 

sivabalan narayanan edited comment on HUDI-2559 at 10/18/21, 1:24 PM:
----------------------------------------------------------------------

Here are the possible solutions:
 # add millisec level granularity to commit timestamp. 
[https://github.com/apache/hudi/pull/2701]
 # Add a per writer config name writerUniqueId in config and user is expected 
to set to unique string for every writer. Hudi does not depend on the actual 
timestamp format and does string based comparison for commit timestamp for any 
ordering in general. So, this should also work. for instance, as of today, 
commit timestamps are as below

20211015191547

If we add a unique writer id as suffix to this,

20211015191547-writer1

 

And so, even if two writers happened to start a new write concurrently, and 
even if same timestamp was generated, commit times will be as follows

20211015191547-writer1

20211015191547-writer2

 

Approach1:

Neat and elegant. very very unlikely, two writers will generate the same 
timestamp as timestamp need to match at millisec granularity. 

Approach2:

This also should work. If approach1 takes more time to develop or runs into any 
issues, this solution should be straight forward. we can think about releasing 
this as first version and go with approach1 later if need be. 

 

 

 

 


was (Author: shivnarayan):
Here are the possible solutions:
 # add millisec level granularity to commit timestamp. 
[https://github.com/apache/hudi/pull/2701]
 # Add a per writer config name writerUniqueId in config and user is expected 
to set to unique string for every writer. Hudi does not depend on the actual 
timestamp format and does string based comparison for commit timestamp for any 
ordering in general. So, this should also work. for instance, as of today, 
commit timestamps are as below

20211015191547

If we add a unique writer id as suffix to this,

20211015191547-writer1

 

And so, even if two writers happened to start a new write concurrently, and 
even if same timestamp was generated, commit times will be as follows

20211015191547-writer1

20211015191547-writer2

 

Approach1:

Neat and elegant. very very unlikely, two writers will generate the same 
timestamp as timestamp need to match at millisec granularity. 

Approach2:

This also should work. If approach1 takes more time or runs into any issues, 
this solution should be straight forward. we can think about releasing this as 
first version and go with approach1 later if need be. 

 

 

 

 

> Ensure unique timestamps are generated for commit times with concurrent 
> writers
> -------------------------------------------------------------------------------
>
>                 Key: HUDI-2559
>                 URL: https://issues.apache.org/jira/browse/HUDI-2559
>             Project: Apache Hudi
>          Issue Type: Improvement
>            Reporter: sivabalan narayanan
>            Assignee: sivabalan narayanan
>            Priority: Major
>
> Ensure unique timestamps are generated for commit times with concurrent 
> writers.
> this is the piece of code in HoodieActiveTimeline which creates a new commit 
> time.
> {code:java}
> public static String createNewInstantTime(long milliseconds) {
>   return lastInstantTime.updateAndGet((oldVal) -> {
>     String newCommitTime;
>     do {
>       newCommitTime = HoodieActiveTimeline.COMMIT_FORMATTER.format(new 
> Date(System.currentTimeMillis() + milliseconds));
>     } while (HoodieTimeline.compareTimestamps(newCommitTime, 
> LESSER_THAN_OR_EQUALS, oldVal));
>     return newCommitTime;
>   });
> }
> {code}
> There are chances that a deltastreamer and a concurrent spark ds writer gets 
> same timestamp and one of them fails. 
> Related issues and github jiras: 
> [https://github.com/apache/hudi/issues/3782]
> https://issues.apache.org/jira/browse/HUDI-2549
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to