[
https://issues.apache.org/jira/browse/FLINK-25330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17461651#comment-17461651
]
Jing Ge edited comment on FLINK-25330 at 12/20/21, 10:48 PM:
-------------------------------------------------------------
Hi Bruce,
multi versions support is one of core design features of HBase. From the HBase
Delete API, we can see the default deletion behaviour is to delete the last
version.
{code:java}
addColumn(final byte [] family, final byte [] qualifier) {code}
Speaking of Flink case, HBase is not only used for CDC, it has been used in
many different big data processing scenarios with Flink, like user behaviour
analytics, churn analytics, could actually be used in every phase of the AARRR
module, including trigger persona based promotion operation, where historical
versions of the users' tracking data will be consumed. Because the data is so
important, physical deletions are generally converted to logical deletions. We
could think it from a different direction. If all versions should be always
deleted for any delete request, why should HBase design the multi versions and
provide the API in the first place? It will consume more resource and provide
no extra value.
Back to your scenario, since you want to delete all versions, it looks like you
only need one version for the column, could it solve your problem if you let
the column only store 1 version? WDYT?
was (Author: jingge):
Hi Bruce,
multi versions support is one of core design feature of HBase. From the HBase
Delete API, we can see the default deletion behaviour is to delete the last
version.
{code:java}
addColumn(final byte [] family, final byte [] qualifier) {code}
Speaking of Flink case, HBase is not only used for CDC, it has been used in
many different big data processing scenarios with Flink, like user behaviour
analytics, churn analytics, could actually be used in every phase of the AARRR
module, including trigger persona based promotion operation, where historical
versions of the users' tracking data will be consumed. Because the data is so
important, physical deletions are generally converted to logical deletions. We
could think it from a different direction. If all versions should be always
deleted for any delete request, why should HBase design the multi versions and
provide the API in the first place? It will consume more resource and provide
no extra value.
Back to your scenario, since you want to delete all versions, it looks like you
only need one version for the column, could it solve your problem if you let
the column only store 1 version? WDYT?
> Flink SQL doesn't retract all versions of Hbase data
> ----------------------------------------------------
>
> Key: FLINK-25330
> URL: https://issues.apache.org/jira/browse/FLINK-25330
> Project: Flink
> Issue Type: Bug
> Components: Connectors / HBase
> Reporter: Bruce Wong
> Assignee: Jing Ge
> Priority: Critical
> Labels: pull-request-available
> Attachments: image-2021-12-15-20-05-18-236.png
>
>
> h2. Background
> When we use CDC to synchronize mysql data to HBase, we find that HBase
> deletes only the last version of the specified rowkey when deleting mysql
> data. The data of the old version still exists. You end up using the wrong
> data. And I think its a bug of HBase connector.
> The following figure shows Hbase data changes before and after mysql data is
> deleted.
> !image-2021-12-15-20-05-18-236.png|width=910,height=669!
>
> h2.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)