[ https://issues.apache.org/jira/browse/NIFI-4026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Pierre Villard resolved NIFI-4026. ---------------------------------- Resolution: Duplicate Fix Version/s: 1.8.0 > SiteToSite Partitioning > ----------------------- > > Key: NIFI-4026 > URL: https://issues.apache.org/jira/browse/NIFI-4026 > Project: Apache NiFi > Issue Type: Improvement > Components: Core Framework > Reporter: Pierre Villard > Priority: Major > Fix For: 1.8.0 > > > To answer some use cases and to always provide more flexibility to the > Site-to-Site mechanism it would be interesting to introduce a S2S > Partitioning Key. > The idea would be to add a parameter in the S2S configuration to compute the > destination node based on the attribute of a flow file. The user would set > the attribute to read from the incoming flow files and a hashing function > would be applied on this attribute value to get a number between 1 and N (N > being the number of nodes on the remote cluster) to select the destination > node. > It could even be possible to let the user code a custom hashing function in a > scripting language. > This approach would potentially force the “batching” to 1, or it could be > necessary to create bins to batch together flow files that are supposed to go > to the same node. > Obviously, it comes the question regarding how to handle cluster scale > up/down. However, I believe this is an edge case and should not be blocking > this feature. > Some of the use cases could be: > - better load balancing of the flow files when using the List/Fetch pattern > (example: ListHDFS/FetchHDFS and load balance based on the size of the remote > file to fetch) > - being able to keep on the same node the data related to the same element > (based on business requirements, example: all the logs from a given host > should be merged in the same file and not have one file per NiFi node) > - give the possibility to send all the data back to the primary node (we > could say that if the hash function returns 0, then the destination node is > the primary node) in case this is required for specific operations. This > would avoid the need to do the full workflow on the primary node only when > some parts can be load balanced. > I also think that this work would be a good foundation for the "node > labeling" stuff that has been discussed on the mailing lists in the past. -- This message was sent by Atlassian JIRA (v7.6.3#76005)