[
https://issues.apache.org/jira/browse/FLINK-19706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jingsong Lee updated FLINK-19706:
---------------------------------
Fix Version/s: 1.12.0
> Add WARN logs when hive table partition has existed before commit in
> `MetastoreCommitPolicy`
> -----------------------------------------------------------------------------------------------
>
> Key: FLINK-19706
> URL: https://issues.apache.org/jira/browse/FLINK-19706
> Project: Flink
> Issue Type: Improvement
> Components: Connectors / FileSystem, Connectors / Hive, Table SQL /
> Runtime
> Reporter: Lsw_aka_laplace
> Assignee: Lsw_aka_laplace
> Priority: Minor
> Labels: pull-request-available
> Fix For: 1.12.0
>
> Attachments: image-2020-10-19-16-47-39-354.png,
> image-2020-10-19-16-57-02-661.png, image-2020-10-19-17-00-27-255.png,
> image-2020-10-19-17-03-21-558.png, image-2020-10-19-18-16-35-083.png
>
>
> dfHi all,
> Recently we have been devoted to using Hive Streaming Writing to
> accelerate our data-sync of Data Warehouse based on Hive, and eventually we
> made it.
> For producing purpose, a lot of metrics/logs/measures were added in
> order to help us analyze running info or fix some unexpected problems. Among
> these mentioned above, we found that Checking Repeated Partition Commit is
> the most important one. So here, we are willing to make a contribution of
> introducing this backwards to Community.
> If this proposal is meaning, I am happy to introduce my design and
> implementation.
>
> Looking forward to ANY opinion~
>
>
> ----UPDATE ----
>
> Our user(using our own platform to build his own Flink job)raised some
> Requests. One of the requests is that once the parition is commited, the data
> in this partitio is regarded as frozen or completed. [Commiting partition]
> seem like a gurantee(but we all know it is hard to be a promise) in some way
> which tells us this partition is completed. Certainly, we make a lot of
> measures try to achieve that [partition-commit means completed]. So if a
> partition is committed twice or more times, for us, there must be sth wrong
> or our measures are insufficent. On the other hand, it also inform us to do
> sth to make up to avoid data-loss or data-incompletion.
>
> So first of all, it is important to let us or help us know that certain
> partition is committed repeatedly. So that we can do the following things ASAP
> 1. analyze the reason or the cause
> 2. do some trade-off operations
> 3. improve our code/measures
>
> — Design and Implementation---
> There are basically two ways, both of them have been used in prod-env
> Approach1
> Add measures in CommitPolicy and be called before partition commit
> !image-2020-10-19-16-47-39-354.png|width=576,height=235!
> //{color:#ffab00}Newly posted, see here{color}
> !image-2020-10-19-18-16-35-083.png|width=725,height=313!
> 1.1 As the pic shows, add `checkPartitionExists` and implement it in
> sub-class
> !image-2020-10-19-17-03-21-558.png|width=1203,height=88!
> 1.2 call checkPartitionExists before partition commit
> ---
> Approach2
> Build a bounded cache of committed partitions and check it everytime before
> partition commit
> (actually this cache supposed to be a operator state)
> !image-2020-10-19-16-57-02-661.png|width=1298,height=57!
> 2.1 build a cache
> !image-2020-10-19-17-00-27-255.png|width=1235,height=116!
> 2.2 check before commit
>
>
> — UPDATE —
> After discussed with [~lzljs3620320], `Repeated partition check` seems a
> little misleading in semantics, so only some WARN logs will be added in
> `MetastoreCommitPolicy` in aware of repeated commit
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)