[jira] [Commented] (HUDI-944) Support more complete concurrency control when writing data

Vinoth Chandar (Jira) Fri, 29 May 2020 07:33:51 -0700


    [ 
https://issues.apache.org/jira/browse/HUDI-944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17119668#comment-17119668
 ]


Vinoth Chandar commented on HUDI-944:
-------------------------------------

Hi [~309637554] In fact I am thinking along these lines as well currently.. We 
need this for RFC-15/RFC-08 work potentially as well (since compaction and 
ingestion will be concurrently writing to the metadata hoodie table.. cc 
[~pwason] )

Typing back to the clustering discussion, I actually think that can happen 
without requiring concurrent writing..

I would like to break this effort up into two steps, if you also agree 

*a) Support parallel writing*

(i.e the users are responsible for ensuring there are no conflicts when writing 
i.e they are touching different partitions).

This is a very practical solution for common cases where you are writing to 
latest partitions , while doing some cleanup/re-writing on the older 
partitions.. To make this happen, we need the following change (need to think 
about some surrounding context as well)..
 - Today, before we begin a commit/delta commit, we rollback the last inflight 
commit assuming it's the failed write from last run.. We should change code to 
not do this.
 - Instead, we will have to rely on marker file based rollback ( HUDI-839 have 
a PR up already, need to write more tests) and defer this rollback to 
cleaning/archival.. 

This is some ground work that we need to do for b anyway..  

*b) Support concurrent writing*

Handling concurrent updates is actually fairly straightforward given Hudi's 
design. but inserts need lot more thought. Given Hudi enforces an unique key 
constraint, we need to think differently to handle inserts. i.e two 
transactions inserting same records, only one of them should succeed.. What 
other systems do out there is just implement, file level fencing and pretend 
inserts don't conflict with each other.. But, if you used an RDBMS you will 
hopefully agree that this is not correct.. 

If we were to eventually use Hudi to perform stateful streaming computations 
with Flink, we need to solve the general problem.. I have some rough ideas.. 
Been having some conversations with some very very senior academia people as 
well.. Happy to share more as we go..

 

May I suggest that we get started with (a) for now.. If you could take that up 
(we can make a sub task here) and familiarize with the code with a well scoped 
problem like (a), we can keep discussing (b) and arrive at a solution..

>From what I heard from users, (a) itself is very practical and has a lot of 
>mileage.. Please let me know what your thoughts are :)  

 

 

 

> Support more complete  concurrency control when writing data
> ------------------------------------------------------------
>
>                 Key: HUDI-944
>                 URL: https://issues.apache.org/jira/browse/HUDI-944
>             Project: Apache Hudi
>          Issue Type: New Feature
>            Reporter: liwei
>            Assignee: liwei
>            Priority: Major
>             Fix For: 0.6.0
>
>
> Now hudi just support write、compaction concurrency control. But some scenario 
> need write concurrency control.Such as two spark job with different data 
> source ,need to write to the same hudi table.
> I have two Proposal：
> 1. first step :support write concurrency control on different partition
>  but now when two client write data to different partition, will meet these 
> error
> a、Rolling back commits failed
> b、instants version already exist
> {code:java}
>  [2020-05-25 21:20:34,732] INFO Checking for file exists 
> ?/tmp/HudiDLATestPartition/.hoodie/20200525212031.clean.inflight 
> (org.apache.hudi.common.table.timeline.HoodieActiveTimeline)
>  Exception in thread "main" org.apache.hudi.exception.HoodieIOException: 
> Failed to create file /tmp/HudiDLATestPartition/.hoodie/20200525212031.clean
>  at 
> org.apache.hudi.common.table.timeline.HoodieActiveTimeline.createImmutableFileInPath(HoodieActiveTimeline.java:437)
>  at 
> org.apache.hudi.common.table.timeline.HoodieActiveTimeline.transitionState(HoodieActiveTimeline.java:327)
>  at 
> org.apache.hudi.common.table.timeline.HoodieActiveTimeline.transitionCleanInflightToComplete(HoodieActiveTimeline.java:290)
>  at 
> org.apache.hudi.client.HoodieCleanClient.runClean(HoodieCleanClient.java:183)
>  at 
> org.apache.hudi.client.HoodieCleanClient.runClean(HoodieCleanClient.java:142)
>  at 
> org.apache.hudi.client.HoodieCleanClient.lambda$clean$0(HoodieCleanClient.java:88)
>  at 
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
>  {code}
> c、two client's archiving conflict
> d、the read client meets "Unable to infer schema for Parquet. It must be 
> specified manually.;"
> 2. second step:support insert、upsert、compaction concurrency control on 
> different isolation level such as Serializable、WriteSerializable.
> hudi can design a mechanism to check the confict in 
> AbstractHoodieWriteClient.commit()
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-944) Support more complete concurrency control when writing data

Reply via email to