Hi everyone, Simple flow in NiFi 1.2.0: ListHDFS -> FetchHDFS -> PutHDFS
Just moving files from one HDFS folder to another for evaluation purposes, to see if NiFi can be used for this sort of ETL. To benchmark I am doing is on a 50 x 1 GB input files dataset. I am testing out with varying cluster sizes: 1, 2, 3 nodes and am expecting to see linear scalability. I am following the advice in this article <https://community.hortonworks.com/articles/16120/how-do-i-distribute-data-across-a-nifi-cluster.html> and the docs <https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html#clustering> by executing the ListHDFS on the primary and sending the file names to a Remote Process Group which I am expecting to do the load balancing to the FetchHDFS -> PutHDFS executed on each of the nodes. However 90% of the files get sent to only one node even on a 3 node cluster. It seems the load balancing is not Round Robin and the Remote Process Group does not allow you to set one either. Later I found this post <https://community.hortonworks.com/questions/53153/load-balancing-while-the-fetching-of-file-from-a-s.html> which says "Nodes with higher load will get fewer FlowFiles. The load balancing is done in batches for efficiency, so under light load you may not see an exact balanced delivery, but under higher FlowFile volumes you will see a balanced delivery over the 5 minutes delivery statistics." Questions: - What are the ways to get more control and explicitly enforce a load balancing policy like Round Robin across nodes? I found a *DistributeLoad* processor which I haven't tried because based on it's docs it seems to be load balancing in between multiple outbound processors (which obviously will be on the same node). - NiFi seems to be sensitive to skews in input file sizes because it treats files as one (does not partition them) which means that larger files get processed by one node and will effectively be processed much slower. What are the recommended ways to mitigate this? Thanks, M