[ 
https://issues.apache.org/jira/browse/HUDI-944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17127395#comment-17127395
 ] 

liwei commented on HUDI-944:
----------------------------

Thanks so much [~vinoth]

I am so agree with you.

First , I also think (a)  is very practical for many scenario, the only 
disadvantage is that it may generate a large number of partitions. I am very 
happy to take up (a) , can i start with adding more tests to  HUDI-839 ?:)

Second, I think  (b) is valuable for hudi. Because now users use hudi client 
with spark to write large amount of data .This is due to the distributed 
ability of spark, but this solution need a spark cluster, it is complicated in 
some scene . Some users need lightweight solution to  just  concurrency write 
with client.

Third, about (b) i have some rough ideas. “inserts. i.e two transactions 
inserting same records, only one of them should succeed.” This scenes we also 
meet. In some database , use bucket or sharding to solve this problem. With 
bucket  users need  to  first bucket there data with the key using hash 
partition  algorithm(like kafka built in such algorithm), then different hudi 
client write the data with different key and will not conflict when concurrency 
writing data. The shortcoming is bucket data rely on users bucket data before 
write to hudi. But i also think this solution may makes sense. Because hudi is 
a storage format now ,do not have service to hash the write data then 
concurrency writing data to hudi. Is 
https://issues.apache.org/jira/browse/HUDI-55  relevant ?:)

 

thank you very much,

liwei

> Support more complete  concurrency control when writing data
> ------------------------------------------------------------
>
>                 Key: HUDI-944
>                 URL: https://issues.apache.org/jira/browse/HUDI-944
>             Project: Apache Hudi
>          Issue Type: New Feature
>            Reporter: liwei
>            Assignee: liwei
>            Priority: Major
>             Fix For: 0.6.0
>
>
> Now hudi just support write、compaction concurrency control. But some scenario 
> need write concurrency control.Such as two spark job with different data 
> source ,need to write to the same hudi table.
> I have two Proposal:
> 1. first step :support write concurrency control on different partition
>  but now when two client write data to different partition, will meet these 
> error
> a、Rolling back commits failed
> b、instants version already exist
> {code:java}
>  [2020-05-25 21:20:34,732] INFO Checking for file exists 
> ?/tmp/HudiDLATestPartition/.hoodie/20200525212031.clean.inflight 
> (org.apache.hudi.common.table.timeline.HoodieActiveTimeline)
>  Exception in thread "main" org.apache.hudi.exception.HoodieIOException: 
> Failed to create file /tmp/HudiDLATestPartition/.hoodie/20200525212031.clean
>  at 
> org.apache.hudi.common.table.timeline.HoodieActiveTimeline.createImmutableFileInPath(HoodieActiveTimeline.java:437)
>  at 
> org.apache.hudi.common.table.timeline.HoodieActiveTimeline.transitionState(HoodieActiveTimeline.java:327)
>  at 
> org.apache.hudi.common.table.timeline.HoodieActiveTimeline.transitionCleanInflightToComplete(HoodieActiveTimeline.java:290)
>  at 
> org.apache.hudi.client.HoodieCleanClient.runClean(HoodieCleanClient.java:183)
>  at 
> org.apache.hudi.client.HoodieCleanClient.runClean(HoodieCleanClient.java:142)
>  at 
> org.apache.hudi.client.HoodieCleanClient.lambda$clean$0(HoodieCleanClient.java:88)
>  at 
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
>  {code}
> c、two client's archiving conflict
> d、the read client meets "Unable to infer schema for Parquet. It must be 
> specified manually.;"
> 2. second step:support insert、upsert、compaction concurrency control on 
> different isolation level such as Serializable、WriteSerializable.
> hudi can design a mechanism to check the confict in 
> AbstractHoodieWriteClient.commit()
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to