[
https://issues.apache.org/jira/browse/NIFI-4026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16681065#comment-16681065
]
Koji Kawamura commented on NIFI-4026:
-------------------------------------
[~pvillard] Just for checking. Do we have anything to add to the load-balancing
capability we released with 1.8.0? Or can this JIRA be closed now? I believe
the cluster topology change scenario is handled by node offloading and
consistent hashing.
> SiteToSite Partitioning
> -----------------------
>
> Key: NIFI-4026
> URL: https://issues.apache.org/jira/browse/NIFI-4026
> Project: Apache NiFi
> Issue Type: Improvement
> Components: Core Framework
> Reporter: Pierre Villard
> Priority: Major
>
> To answer some use cases and to always provide more flexibility to the
> Site-to-Site mechanism it would be interesting to introduce a S2S
> Partitioning Key.
> The idea would be to add a parameter in the S2S configuration to compute the
> destination node based on the attribute of a flow file. The user would set
> the attribute to read from the incoming flow files and a hashing function
> would be applied on this attribute value to get a number between 1 and N (N
> being the number of nodes on the remote cluster) to select the destination
> node.
> It could even be possible to let the user code a custom hashing function in a
> scripting language.
> This approach would potentially force the “batching” to 1, or it could be
> necessary to create bins to batch together flow files that are supposed to go
> to the same node.
> Obviously, it comes the question regarding how to handle cluster scale
> up/down. However, I believe this is an edge case and should not be blocking
> this feature.
> Some of the use cases could be:
> - better load balancing of the flow files when using the List/Fetch pattern
> (example: ListHDFS/FetchHDFS and load balance based on the size of the remote
> file to fetch)
> - being able to keep on the same node the data related to the same element
> (based on business requirements, example: all the logs from a given host
> should be merged in the same file and not have one file per NiFi node)
> - give the possibility to send all the data back to the primary node (we
> could say that if the hash function returns 0, then the destination node is
> the primary node) in case this is required for specific operations. This
> would avoid the need to do the full workflow on the primary node only when
> some parts can be load balanced.
> I also think that this work would be a good foundation for the "node
> labeling" stuff that has been discussed on the mailing lists in the past.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)