[jira] [Commented] (NIFI-4026) SiteToSite Partitioning

Pierre Villard (JIRA) Thu, 07 Jun 2018 07:52:13 -0700


    [ 
https://issues.apache.org/jira/browse/NIFI-4026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16504769#comment-16504769
 ]


Pierre Villard commented on NIFI-4026:
--------------------------------------

Thanks for your feedback [~markap14] - I really think this would be a nice 
feature in order to address a new set of use cases.

I filed this JIRA one year ago and didn't find time to work on it so far, but 
could be able to give it a try in the coming weeks.

Regarding your comments:
 * completely agree with you about the scripting language, we certainly don't 
want to take that route, and as you said it could be done in a processor before 
the RPG.
 * I'm not familiar with the S2S internals, but creating multiple transactions 
sounds perfect.
 * I share your views regarding scaling up/down situations, we should do our 
best to ensure consistency when a cluster is restarted or when node are 
temporarily disconnected.

In S2S, when a client asks for the nodes in a cluster, I think we should have 
the whole picture: the connection status of the nodes but also which node is 
the primary node. Allowing users to send "back" all the data on the primary 
node could be useful in some use cases. I'll create a subtask and start working 
on this part.

Regarding the persistence of the cluster topology - not sure to see how this 
would work (I guess, I need to go into the code...). Let's say I've a 10-nodes 
cluster and I'm persisting the cluster topology. When restarting the cluster, 
if we have only 8 nodes starting (and if we don't expect the missing two nodes 
to be back), when would you consider the persisted topology to be obsolete and 
replaced by the 8 nodes topology?

 

> SiteToSite Partitioning
> -----------------------
>
>                 Key: NIFI-4026
>                 URL: https://issues.apache.org/jira/browse/NIFI-4026
>             Project: Apache NiFi
>          Issue Type: Improvement
>          Components: Core Framework
>            Reporter: Pierre Villard
>            Priority: Major
>
> To answer some use cases and to always provide more flexibility to the 
> Site-to-Site mechanism it would be interesting to introduce a S2S 
> Partitioning Key.
> The idea would be to add a parameter in the S2S configuration to compute the 
> destination node based on the attribute of a flow file. The user would set 
> the attribute to read from the incoming flow files and a hashing function 
> would be applied on this attribute value to get a number between 1 and N (N 
> being the number of nodes on the remote cluster) to select the destination 
> node.
> It could even be possible to let the user code a custom hashing function in a 
> scripting language.
> This approach would potentially force the “batching” to 1, or it could be 
> necessary to create bins to batch together flow files that are supposed to go 
> to the same node.
> Obviously, it comes the question regarding how to handle cluster scale 
> up/down. However, I believe this is an edge case and should not be blocking 
> this feature.
> Some of the use cases could be:
> - better load balancing of the flow files when using the List/Fetch pattern 
> (example: ListHDFS/FetchHDFS and load balance based on the size of the remote 
> file to fetch)
> - being able to keep on the same node the data related to the same element 
> (based on business requirements, example: all the logs from a given host 
> should be merged in the same file and not have one file per NiFi node)
> - give the possibility to send all the data back to the primary node (we 
> could say that if the hash function returns 0, then the destination node is 
> the primary node) in case this is required for specific operations. This 
> would avoid the need to do the full workflow on the primary node only when 
> some parts can be load balanced.
> I also think that this work would be a good foundation for the "node 
> labeling" stuff that has been discussed on the mailing lists in the past.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (NIFI-4026) SiteToSite Partitioning

Reply via email to