[ 
https://issues.apache.org/jira/browse/FLINK-19706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lsw_aka_laplace updated FLINK-19706:
------------------------------------
    Summary: Add WARN logs when hive table partition has existed before commit 
in `MetastoreCommitPolicy`     (was: Introduce `Repeated Partition Commit 
Check` in `org.apache.flink.table.filesystem.PartitionCommitPolicy` )

> Add WARN logs when hive table partition has existed before commit in 
> `MetastoreCommitPolicy`   
> -----------------------------------------------------------------------------------------------
>
>                 Key: FLINK-19706
>                 URL: https://issues.apache.org/jira/browse/FLINK-19706
>             Project: Flink
>          Issue Type: Improvement
>          Components: Connectors / FileSystem, Connectors / Hive, Table SQL / 
> Runtime
>            Reporter: Lsw_aka_laplace
>            Assignee: Lsw_aka_laplace
>            Priority: Minor
>         Attachments: image-2020-10-19-16-47-39-354.png, 
> image-2020-10-19-16-57-02-661.png, image-2020-10-19-17-00-27-255.png, 
> image-2020-10-19-17-03-21-558.png, image-2020-10-19-18-16-35-083.png
>
>
> dfHi all,
>       Recently we have been devoted to using Hive Streaming Writing to 
> accelerate our data-sync of Data Warehouse based on Hive, and eventually we 
> made it. 
>        For producing purpose, a lot of metrics/logs/measures were added in 
> order to help us analyze running info or fix some unexpected problems. Among 
> these mentioned above, we found that Checking Repeated Partition Commit is 
> the most important one. So here, we are willing to make a contribution of 
> introducing this backwards to Community.
>      If this proposal is meaning, I am happy to introduce my design and 
> implementation.
>  
> Looking forward to ANY opinion~
>  
>  
> ----UPDATE ----
>  
> Our user(using our own platform to build his own Flink job)raised some 
> Requests. One of the requests is that once the parition is commited, the data 
> in this partitio is regarded as frozen or completed. [Commiting partition] 
> seem like a gurantee(but we all know it is hard to be a promise) in some way 
> which tells us this partition is completed. Certainly, we make a lot of 
> measures try to achieve that [partition-commit means completed]. So if a 
> partition is committed twice or more times, for us, there must be sth wrong 
> or our measures are insufficent.  On the other hand, it also inform us to do 
> sth to make up to avoid data-loss or data-incompletion.  
>  
> So first of all, it is important to let us or help us know that certain 
> partition is committed repeatedly. So that we can do the following things ASAP
>    1. analyze the reason or the cause 
>    2. do some trade-off operations
>    3. improve our code/measures
>  
> — Design and Implementation--- 
> There are basically two ways, both of them have been used in prod-env
> Approach1
> Add measures in CommitPolicy and be called before partition commit
> !image-2020-10-19-16-47-39-354.png|width=576,height=235!
> //{color:#ffab00}Newly posted, see here{color}
> !image-2020-10-19-18-16-35-083.png|width=725,height=313!
>  1.1 As the pic shows, add `checkPartitionExists` and implement it in 
> sub-class
>   !image-2020-10-19-17-03-21-558.png|width=1203,height=88!
>  1.2 call checkPartitionExists before partition commit
> --- 
> Approach2
> Build a bounded cache of committed partitions and check it everytime before 
> partition commit 
> (actually this cache supposed to be a operator state)
> !image-2020-10-19-16-57-02-661.png|width=1298,height=57!
>   2.1 build a cache
> !image-2020-10-19-17-00-27-255.png|width=1235,height=116!
>   2.2 check before commit 
>  
>  
> — UPDATE —
> After discussed with [~lzljs3620320], `Repeated partition check` seems  a 
> little misleading in semantics, so only some WARN logs will be added in 
> `MetastoreCommitPolicy` in aware of repeated commit 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to