[jira] [Comment Edited] (FLINK-25330) Flink SQL doesn't retract all versions of Hbase data

Jing Ge (Jira) Mon, 20 Dec 2021 14:49:06 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-25330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17461651#comment-17461651
 ]


Jing Ge edited comment on FLINK-25330 at 12/20/21, 10:48 PM:
-------------------------------------------------------------

Hi Bruce,

multi versions support is one of core design features of HBase. From the HBase 
Delete API, we can see the default deletion behaviour is to delete the last 
version.
{code:java}
addColumn(final byte [] family, final byte [] qualifier) {code}
Speaking of Flink case, HBase is not only used for CDC, it has been used in 
many different big data processing scenarios with Flink, like user behaviour 
analytics, churn analytics, could actually be used in every phase of the AARRR 
module, including trigger persona based promotion operation, where historical 
versions of the users' tracking data will be consumed. Because the data is so 
important, physical deletions are generally converted to logical deletions. We 
could think it from a different direction. If all versions should be always 
deleted for any delete request, why should HBase design the multi versions and 
provide the API in the first place? It will consume more resource and provide 
no extra value.

Back to your scenario, since you want to delete all versions, it looks like you 
only need one version for the column, could it solve your problem if you let 
the column only store 1 version? WDYT?


was (Author: jingge):
Hi Bruce,

multi versions support is one of core design feature of HBase. From the HBase 
Delete API, we can see the default deletion behaviour is to delete the last 
version.
{code:java}
addColumn(final byte [] family, final byte [] qualifier) {code}
Speaking of Flink case, HBase is not only used for CDC, it has been used in 
many different big data processing scenarios with Flink, like user behaviour 
analytics, churn analytics, could actually be used in every phase of the AARRR 
module, including trigger persona based promotion operation, where historical 
versions of the users' tracking data will be consumed. Because the data is so 
important, physical deletions are generally converted to logical deletions. We 
could think it from a different direction. If all versions should be always 
deleted for any delete request, why should HBase design the multi versions and 
provide the API in the first place? It will consume more resource and provide 
no extra value.

Back to your scenario, since you want to delete all versions, it looks like you 
only need one version for the column, could it solve your problem if you let 
the column only store 1 version? WDYT?

> Flink SQL doesn't retract all versions of Hbase data
> ----------------------------------------------------
>
>                 Key: FLINK-25330
>                 URL: https://issues.apache.org/jira/browse/FLINK-25330
>             Project: Flink
>          Issue Type: Bug
>          Components: Connectors / HBase
>            Reporter: Bruce Wong
>            Assignee: Jing Ge
>            Priority: Critical
>              Labels: pull-request-available
>         Attachments: image-2021-12-15-20-05-18-236.png
>
>
> h2. Background
> When we use CDC to synchronize mysql data to HBase, we find that HBase 
> deletes only the last version of the specified rowkey when deleting mysql 
> data. The data of the old version still exists. You end up using the wrong 
> data. And I think its a bug of HBase connector.
> The following figure shows Hbase data changes before and after mysql data is 
> deleted.
> !image-2021-12-15-20-05-18-236.png|width=910,height=669!
>  
> h2.  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Comment Edited] (FLINK-25330) Flink SQL doesn't retract all versions of Hbase data

Reply via email to