[
https://issues.apache.org/jira/browse/FLINK-23730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17404838#comment-17404838
]
Carl commented on FLINK-23730:
------------------------------
[~luoyuxia] Thanks for you replay. I have update the describe of the issue.
> Source from hive sink hbase lost data
> -------------------------------------
>
> Key: FLINK-23730
> URL: https://issues.apache.org/jira/browse/FLINK-23730
> Project: Flink
> Issue Type: Bug
> Components: Connectors / HBase, Connectors / Hive
> Affects Versions: 1.12.1
> Reporter: Carl
> Priority: Major
> Attachments: image-2021-08-26-09-43-39-055.png,
> image-2021-08-26-09-44-20-390.png, image-2021-08-26-09-50-35-061.png,
> image-2021-08-26-09-51-38-899.png, image-2021-08-26-09-52-35-808.png,
> image-2021-08-26-10-05-23-289.png
>
>
> Our use case is as follows,
> # hive source: create hive table which meta data is in HMS
> # create hbase use hbase shell
> # flink sql ddl: create hbase flink table
> # use hive catalog: use flink sql insert into hbase flink table
> if i set the tableconfig: table.exec.hive.infer-source-parallelism = false
> The program will run as one parallelism,and the number of records of results
> is correct.
> but if i set the tableconfig: table.exec.hive.infer-source-parallelism = true
> The program will run as twenty parallelism that express source parallelism is
> inferred according to splits number,and the number of records of results is
> not correct.
>
> The test was repeated many times and there was no exception occurred.
>
> So I guess it has something to do with high concurrency. Does it lose data
> because of high concurrency?
>
> update1----------------------
>
> The program is as follows,
>
> *HBase table :*
> !image-2021-08-26-09-50-35-061.png!
>
> *Flink HBase table:*
> !image-2021-08-26-09-51-38-899.png!
>
> *hive table:*
> !image-2021-08-26-09-52-35-808.png!
>
> modify the two arguments to control the parallelism of flink,
> *table.exec.hive.infer-source-parallelism*
> *table.exec.hive.infer-source-parallelism.max*
> 1. if i set table.exec.hive.infer-source-parallelism=*false*, flink will run
> as one parallelism, and result is correct.
> 2. if i set them as follows, flink will run as 10 parallelism, and result is
> correct.
> table.exec.hive.infer-source-parallelism=*true*
> table.exec.hive.infer-source-parallelism.max=*10*
> 3. but if i set them as follows, flink will run as 20 parallelism, and result
> is not correct. And the sum rows of hbase is lesss than hive
> table.exec.hive.infer-source-parallelism=*true*
> table.exec.hive.infer-source-parallelism.max=*20*
>
> the data of hive table as follows
>
> !image-2021-08-26-10-05-23-289.png!
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)