Martin,

The problem you're hitting is that site-to-site doesn't by default do
file by file load balancing.  It sends a set of files to one node,
then a set to another, and so on.  This was tuned for constant high
rate/volume transmission so a test like this will have funny results.
Did you tune the batch settings in site-to-site which become available
due to https://issues.apache.org/jira/browse/NIFI-1202

You can set it to batch sizes of one I'd assume (i've never done this)
and that should then behave the way you're looking for.

Thanks

On Fri, Jun 2, 2017 at 3:32 PM, Martin Eden <martineden...@gmail.com> wrote:
> Hi everyone,
>
> Simple flow in NiFi 1.2.0:
> ListHDFS -> FetchHDFS -> PutHDFS
>
> Just moving files from one HDFS folder to another for evaluation purposes,
> to see if NiFi can be used for this sort of ETL.
>
> To benchmark I am doing is on a 50 x 1 GB input files dataset.
>
> I am testing out with varying cluster sizes: 1, 2, 3 nodes and am expecting
> to see linear scalability.
>
> I am following the advice in this article
> <https://community.hortonworks.com/articles/16120/how-do-i-distribute-data-across-a-nifi-cluster.html>
> and
> the docs
> <https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html#clustering>
> by
> executing the ListHDFS on the primary and sending the file names to a
> Remote Process Group which I am expecting to do the load balancing to the
> FetchHDFS -> PutHDFS executed on each of the nodes. However 90% of the
> files get sent to only one node even on a 3 node cluster. It seems the load
> balancing is not Round Robin and the Remote Process Group does not allow
> you to set one either. Later I found this post
> <https://community.hortonworks.com/questions/53153/load-balancing-while-the-fetching-of-file-from-a-s.html>
> which
> says "Nodes with higher load will get fewer FlowFiles. The load balancing
> is done in batches for efficiency, so under light load you may not see an
> exact balanced delivery, but under higher FlowFile volumes you will see a
> balanced delivery over the 5 minutes delivery statistics."
>
> Questions:
> - What are the ways to get more control and explicitly enforce a load
> balancing policy like Round Robin across nodes?
> I found a *DistributeLoad* processor which I haven't tried because based on
> it's docs it seems to be load balancing in between multiple outbound
> processors (which obviously will be on the same node).
> - NiFi seems to be sensitive to skews in input file sizes because it treats
> files as one (does not partition them) which means that larger files get
> processed by one node and will effectively be processed much slower. What
> are the recommended ways to mitigate this?
>
> Thanks,
> M

Reply via email to