[ 
https://issues.apache.org/jira/browse/NIFI-817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14945138#comment-14945138
 ] 

Bryan Bende commented on NIFI-817:
----------------------------------

All, going to pick up where Mark left off and try make progress on this ticket 
for 0.4.0...

On the extraction side of things, here is what I gathered from reading the 
above discussion...
* The GetHBase processor in the patch saves state on a single node, but needs 
to also save state across the cluster, most likely similar to what we do in 
ListHDFS with the distributed cache
* Would like a property/properties to specify columns and column families to 
return, and possibly filters as well
* Consider using Avro as an output mechanism to provide a schema for the results
* Consider using a replication end-point to stream WALs

I looked at the replication endpoint a little bit and it does seem like an 
interesting concept. My understanding is that you deploy a jar to the lib 
directory of every region server that contains the implementation of your 
endpoint, this endpoint is then responsible for sending to the other system, 
and there is also some code that has to be run to register/turn-on your 
endpoint. The best example I found was this:
https://github.com/risdenk/hbase-custom-replication-endpoint-example

We would have to figure out how this replication endpoint would be sending data 
to NiFi, the first thing that comes to mind is through the SiteToSiteClient, 
but haven't really thought through this. I'm wondering if we proceed for now on 
the GetHBase processor (with some improvements above) and track this 
replication idea as another ticket since it would likely have a much different 
feel than a regular processor, thoughts? 

The put side of things seems to be more straight forward... I refactored the 
processor in the current patch to pull in a configurable batch of FlowFiles on 
each call to onTrigger, then group them by table, and make one call to 
table.put(List<Put>) so in the best case if all FlowFiles are for the same 
table then it would be a single call, worst case they are all different tables 
and it would be no different than processing each FlowFile one at a time.


> Create Processors to interact with HBase
> ----------------------------------------
>
>                 Key: NIFI-817
>                 URL: https://issues.apache.org/jira/browse/NIFI-817
>             Project: Apache NiFi
>          Issue Type: New Feature
>          Components: Extensions
>            Reporter: Mark Payne
>            Assignee: Bryan Bende
>             Fix For: 0.4.0
>
>         Attachments: 
> 0001-NIFI-817-Initial-implementation-of-HBase-processors.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to