[ 
https://issues.apache.org/jira/browse/FLINK-20955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17777853#comment-17777853
 ] 

Ryan Skraba commented on FLINK-20955:
-------------------------------------

Hey – I see there's a PR on apache/flink, with a couple rounds of reviews!  Is 
anyone working on, or interested in seeing the PR rebased and migrated to the 
apache/flink-connector-hbase external repo?  I previously did this for the 
Pub/Sub connector FLIP-27 update, and I'd be willing to move this one over if a 
committer can "commit" to taking a look when it's finished! ;)  
[~ferenc-csaky], I see you've committed recently to HBase, what do you think?

In the migration, I'd probably recommend splitting the existing PR so that it 
only includes the FLIP-27 Source, so we can talk about whether to jump directly 
to the SinkV2 FLIP-191 instead.

 

> Refactor HBase Source in accordance with FLIP-27
> ------------------------------------------------
>
>                 Key: FLINK-20955
>                 URL: https://issues.apache.org/jira/browse/FLINK-20955
>             Project: Flink
>          Issue Type: Improvement
>          Components: Connectors / HBase, Table SQL / Ecosystem
>            Reporter: Moritz Manner
>            Priority: Not a Priority
>              Labels: auto-deprioritized-major, auto-deprioritized-minor, 
> auto-unassigned, pull-request-available
>
> The HBase connector source implementation should be updated in accordance 
> with [FLIP-27: Refactor Source 
> Interface|https://cwiki.apache.org/confluence/display/FLINK/FLIP-27%3A+Refactor+Source+Interface].
> One source should map to one table in HBase. Users can specify which 
> column[families] to watch; each change in one of the columns triggers a 
> record with change type, table, column family, column, value, and timestamp.
> h3. Idea
> The new Flink HBase Source makes use of the internal [replication mechanism 
> of HBase|https://hbase.apache.org/book.html#_cluster_replication]. The Source 
> is registering at the HBase cluster and will receive all WAL edits written in 
> HBase. From those WAL edits the Source can create the DataStream. 
> h3. Split
> We're still not 100% sure which information a Split should contain. We have 
> the following possibilities: 
>  # There is only one Split per Source and the Split contains all the 
> necessary information to connect with HBase. The SourceReader which processes 
> the Split will receive all WAL edits for all tables and filters the relevant 
> edits. 
>  # There are multiple Splits per Source, each Split representing one HBase 
> Region to read from. This assumes that it is possible to only receive WAL 
> edits from a specific HBase Region and not receive all WAL edits. This would 
> be preferable as it allows parallel processing of multiple regions, but we 
> still need to figure out how this is possible.
> In both cases the Split will contain information about the HBase instance and 
> table. 
> h3. Split Enumerator
> Depending on which Split we'll decide on, the split enumerator will connect 
> to HBase and get all relevant regions or just create one Split.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to