[ https://issues.apache.org/jira/browse/FLINK-20955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17777853#comment-17777853 ]
Ryan Skraba commented on FLINK-20955: ------------------------------------- Hey – I see there's a PR on apache/flink, with a couple rounds of reviews! Is anyone working on, or interested in seeing the PR rebased and migrated to the apache/flink-connector-hbase external repo? I previously did this for the Pub/Sub connector FLIP-27 update, and I'd be willing to move this one over if a committer can "commit" to taking a look when it's finished! ;) [~ferenc-csaky], I see you've committed recently to HBase, what do you think? In the migration, I'd probably recommend splitting the existing PR so that it only includes the FLIP-27 Source, so we can talk about whether to jump directly to the SinkV2 FLIP-191 instead. > Refactor HBase Source in accordance with FLIP-27 > ------------------------------------------------ > > Key: FLINK-20955 > URL: https://issues.apache.org/jira/browse/FLINK-20955 > Project: Flink > Issue Type: Improvement > Components: Connectors / HBase, Table SQL / Ecosystem > Reporter: Moritz Manner > Priority: Not a Priority > Labels: auto-deprioritized-major, auto-deprioritized-minor, > auto-unassigned, pull-request-available > > The HBase connector source implementation should be updated in accordance > with [FLIP-27: Refactor Source > Interface|https://cwiki.apache.org/confluence/display/FLINK/FLIP-27%3A+Refactor+Source+Interface]. > One source should map to one table in HBase. Users can specify which > column[families] to watch; each change in one of the columns triggers a > record with change type, table, column family, column, value, and timestamp. > h3. Idea > The new Flink HBase Source makes use of the internal [replication mechanism > of HBase|https://hbase.apache.org/book.html#_cluster_replication]. The Source > is registering at the HBase cluster and will receive all WAL edits written in > HBase. From those WAL edits the Source can create the DataStream. > h3. Split > We're still not 100% sure which information a Split should contain. We have > the following possibilities: > # There is only one Split per Source and the Split contains all the > necessary information to connect with HBase. The SourceReader which processes > the Split will receive all WAL edits for all tables and filters the relevant > edits. > # There are multiple Splits per Source, each Split representing one HBase > Region to read from. This assumes that it is possible to only receive WAL > edits from a specific HBase Region and not receive all WAL edits. This would > be preferable as it allows parallel processing of multiple regions, but we > still need to figure out how this is possible. > In both cases the Split will contain information about the HBase instance and > table. > h3. Split Enumerator > Depending on which Split we'll decide on, the split enumerator will connect > to HBase and get all relevant regions or just create one Split. -- This message was sent by Atlassian Jira (v8.20.10#820010)