Hi Joe,

I changed the batch size on the input port of the Remote Process Group to 1
and got the results I was looking for: ~1/3 of the time for a 3 node
cluster compared to 1 node. So big thanks!

Any takes on my second question though?
- NiFi seems to be sensitive to skews in input file sizes because it treats
files as one (does not partition them) which means that larger files get
processed by one node and will effectively be processed much slower. What
are the recommended ways to mitigate this?

Thanks again,
Martin


On Fri, Jun 2, 2017 at 8:37 PM, Joe Witt <joe.w...@gmail.com> wrote:

> Martin,
>
> The problem you're hitting is that site-to-site doesn't by default do
> file by file load balancing.  It sends a set of files to one node,
> then a set to another, and so on.  This was tuned for constant high
> rate/volume transmission so a test like this will have funny results.
> Did you tune the batch settings in site-to-site which become available
> due to https://issues.apache.org/jira/browse/NIFI-1202
>
> You can set it to batch sizes of one I'd assume (i've never done this)
> and that should then behave the way you're looking for.
>
> Thanks
>
> On Fri, Jun 2, 2017 at 3:32 PM, Martin Eden <martineden...@gmail.com>
> wrote:
> > Hi everyone,
> >
> > Simple flow in NiFi 1.2.0:
> > ListHDFS -> FetchHDFS -> PutHDFS
> >
> > Just moving files from one HDFS folder to another for evaluation
> purposes,
> > to see if NiFi can be used for this sort of ETL.
> >
> > To benchmark I am doing is on a 50 x 1 GB input files dataset.
> >
> > I am testing out with varying cluster sizes: 1, 2, 3 nodes and am
> expecting
> > to see linear scalability.
> >
> > I am following the advice in this article
> > <https://community.hortonworks.com/articles/16120/how-do-i-distribute-
> data-across-a-nifi-cluster.html>
> > and
> > the docs
> > <https://nifi.apache.org/docs/nifi-docs/html/administration-
> guide.html#clustering>
> > by
> > executing the ListHDFS on the primary and sending the file names to a
> > Remote Process Group which I am expecting to do the load balancing to the
> > FetchHDFS -> PutHDFS executed on each of the nodes. However 90% of the
> > files get sent to only one node even on a 3 node cluster. It seems the
> load
> > balancing is not Round Robin and the Remote Process Group does not allow
> > you to set one either. Later I found this post
> > <https://community.hortonworks.com/questions/53153/load-balancing-while-
> the-fetching-of-file-from-a-s.html>
> > which
> > says "Nodes with higher load will get fewer FlowFiles. The load balancing
> > is done in batches for efficiency, so under light load you may not see an
> > exact balanced delivery, but under higher FlowFile volumes you will see a
> > balanced delivery over the 5 minutes delivery statistics."
> >
> > Questions:
> > - What are the ways to get more control and explicitly enforce a load
> > balancing policy like Round Robin across nodes?
> > I found a *DistributeLoad* processor which I haven't tried because based
> on
> > it's docs it seems to be load balancing in between multiple outbound
> > processors (which obviously will be on the same node).
> > - NiFi seems to be sensitive to skews in input file sizes because it
> treats
> > files as one (does not partition them) which means that larger files get
> > processed by one node and will effectively be processed much slower. What
> > are the recommended ways to mitigate this?
> >
> > Thanks,
> > M
>

Reply via email to