Hi everyone,

Simple flow in NiFi 1.2.0:
ListHDFS -> FetchHDFS -> PutHDFS

Just moving files from one HDFS folder to another for evaluation purposes,
to see if NiFi can be used for this sort of ETL.

To benchmark I am doing is on a 50 x 1 GB input files dataset.

I am testing out with varying cluster sizes: 1, 2, 3 nodes and am expecting
to see linear scalability.

I am following the advice in this article
<https://community.hortonworks.com/articles/16120/how-do-i-distribute-data-across-a-nifi-cluster.html>
and
the docs
<https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html#clustering>
by
executing the ListHDFS on the primary and sending the file names to a
Remote Process Group which I am expecting to do the load balancing to the
FetchHDFS -> PutHDFS executed on each of the nodes. However 90% of the
files get sent to only one node even on a 3 node cluster. It seems the load
balancing is not Round Robin and the Remote Process Group does not allow
you to set one either. Later I found this post
<https://community.hortonworks.com/questions/53153/load-balancing-while-the-fetching-of-file-from-a-s.html>
which
says "Nodes with higher load will get fewer FlowFiles. The load balancing
is done in batches for efficiency, so under light load you may not see an
exact balanced delivery, but under higher FlowFile volumes you will see a
balanced delivery over the 5 minutes delivery statistics."

Questions:
- What are the ways to get more control and explicitly enforce a load
balancing policy like Round Robin across nodes?
I found a *DistributeLoad* processor which I haven't tried because based on
it's docs it seems to be load balancing in between multiple outbound
processors (which obviously will be on the same node).
- NiFi seems to be sensitive to skews in input file sizes because it treats
files as one (does not partition them) which means that larger files get
processed by one node and will effectively be processed much slower. What
are the recommended ways to mitigate this?

Thanks,
M

Reply via email to