[ 
https://issues.apache.org/jira/browse/NIFI-4026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pierre Villard resolved NIFI-4026.
----------------------------------
       Resolution: Duplicate
    Fix Version/s: 1.8.0

> SiteToSite Partitioning
> -----------------------
>
>                 Key: NIFI-4026
>                 URL: https://issues.apache.org/jira/browse/NIFI-4026
>             Project: Apache NiFi
>          Issue Type: Improvement
>          Components: Core Framework
>            Reporter: Pierre Villard
>            Priority: Major
>             Fix For: 1.8.0
>
>
> To answer some use cases and to always provide more flexibility to the 
> Site-to-Site mechanism it would be interesting to introduce a S2S 
> Partitioning Key.
> The idea would be to add a parameter in the S2S configuration to compute the 
> destination node based on the attribute of a flow file. The user would set 
> the attribute to read from the incoming flow files and a hashing function 
> would be applied on this attribute value to get a number between 1 and N (N 
> being the number of nodes on the remote cluster) to select the destination 
> node.
> It could even be possible to let the user code a custom hashing function in a 
> scripting language.
> This approach would potentially force the “batching” to 1, or it could be 
> necessary to create bins to batch together flow files that are supposed to go 
> to the same node.
> Obviously, it comes the question regarding how to handle cluster scale 
> up/down. However, I believe this is an edge case and should not be blocking 
> this feature.
> Some of the use cases could be:
> - better load balancing of the flow files when using the List/Fetch pattern 
> (example: ListHDFS/FetchHDFS and load balance based on the size of the remote 
> file to fetch)
> - being able to keep on the same node the data related to the same element 
> (based on business requirements, example: all the logs from a given host 
> should be merged in the same file and not have one file per NiFi node)
> - give the possibility to send all the data back to the primary node (we 
> could say that if the hash function returns 0, then the destination node is 
> the primary node) in case this is required for specific operations. This 
> would avoid the need to do the full workflow on the primary node only when 
> some parts can be load balanced.
> I also think that this work would be a good foundation for the "node 
> labeling" stuff that has been discussed on the mailing lists in the past.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to