[
https://issues.apache.org/jira/browse/NIFI-817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14945138#comment-14945138
]
Bryan Bende commented on NIFI-817:
----------------------------------
All, going to pick up where Mark left off and try make progress on this ticket
for 0.4.0...
On the extraction side of things, here is what I gathered from reading the
above discussion...
* The GetHBase processor in the patch saves state on a single node, but needs
to also save state across the cluster, most likely similar to what we do in
ListHDFS with the distributed cache
* Would like a property/properties to specify columns and column families to
return, and possibly filters as well
* Consider using Avro as an output mechanism to provide a schema for the results
* Consider using a replication end-point to stream WALs
I looked at the replication endpoint a little bit and it does seem like an
interesting concept. My understanding is that you deploy a jar to the lib
directory of every region server that contains the implementation of your
endpoint, this endpoint is then responsible for sending to the other system,
and there is also some code that has to be run to register/turn-on your
endpoint. The best example I found was this:
https://github.com/risdenk/hbase-custom-replication-endpoint-example
We would have to figure out how this replication endpoint would be sending data
to NiFi, the first thing that comes to mind is through the SiteToSiteClient,
but haven't really thought through this. I'm wondering if we proceed for now on
the GetHBase processor (with some improvements above) and track this
replication idea as another ticket since it would likely have a much different
feel than a regular processor, thoughts?
The put side of things seems to be more straight forward... I refactored the
processor in the current patch to pull in a configurable batch of FlowFiles on
each call to onTrigger, then group them by table, and make one call to
table.put(List<Put>) so in the best case if all FlowFiles are for the same
table then it would be a single call, worst case they are all different tables
and it would be no different than processing each FlowFile one at a time.
> Create Processors to interact with HBase
> ----------------------------------------
>
> Key: NIFI-817
> URL: https://issues.apache.org/jira/browse/NIFI-817
> Project: Apache NiFi
> Issue Type: New Feature
> Components: Extensions
> Reporter: Mark Payne
> Assignee: Bryan Bende
> Fix For: 0.4.0
>
> Attachments:
> 0001-NIFI-817-Initial-implementation-of-HBase-processors.patch
>
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)