Joe, Thanks, your tuning comments all make sense.
If they didn't have the similar CPU and RAM scales I probably would not have tried it. It's only been running a couple of days, but I've already noticed some anecdotal performance differences. For instance, the Linux and OSX nodes appear process more flow files than the Windows node, I don't know if that's due to the SSDs or the different file systems. The cluster runs better than I expected for non-server hardware. I haven't hammered it hard yet, but eventually I'll pull together some NiFi performance stats and system/OS benchmark control numbers. I had some bad hot spots in the flow, specifically before the ControlRate and UpdateAttribute processors, so I tried splitting the flow with a DistributeLoad to 3 instances of each and did the same for the highest volume PutFile too. That made a big difference and the hot spots were gone. Now there are several warms spots, but the queue sizes are much more even across the graph and a big influx of files moves more steadily through the graph instead of racing from one backup to the next. Does that make sense? Joe On Tue, Sep 27, 2016 at 8:31 AM, Joe Witt <joe.w...@gmail.com> wrote: > JoeS > > I think you are seeing a queue bug that has been corrected or reported > on the 1.x line. > > As for the frankencluster concept i think it is generally fair game. > There are a number of design reasons, most notably back pressure, that > make this approach feasible. So the big ticket items to consider are > things like > > CPU > Since the model of NiFi is that basically all processes/tasks are > eligible to run on all nodes and that when configuring the number of > threads and tasks per controller and component that they are applied > to all nodes this could be problematic when there is a substantive > imbalance of power on the various systems. If this were important to > improve we could allow node-local overrides of max controller threads. > That helps a bit but doesn't really solve it. Again back pressure is > probably the most effective. There are probably a number of things we > could do here if needed. > > Disk > We have to consider the speed, congestion, and storage available on > the disk(s) and how they're partitioned and such for our various > repositories. Again back pressure is one of the more effective > mechanisms here because it is all about doing as much as you can which > means other nodes should be able to take on more/less. Fortunately > the configuration of the repositories and such here are node-local so > we can have pretty considerable variety here and things work pretty > well. > > Network > Back pressure for the win. Though significant imbalances could lead > to significant congestion which could cause inefficiencies in general > so would need to be careful. That scenario would require wildly > imbalanced node capabilities and very high rate flows most likely. > > Memory > JVM Heap size variability and/or off heap memory differences could > cause some nodes to behave wildly different than others in ways that > back pressure will not necessarily solve. For instance a node with > too low heap size for the types of processes in the flow could yield > order(s) of magnitude lower performance than another node. We should > do more for these things. Users should not have to configure things > like swapping thresholds for instance. We should at runtime determine > and tune those values. It is simply too hard to find a good magic > number that predicts the likely number of flow file attributes and > size that might be needed and those can have a substantial impact on > heap usage. Right now we treat swapping on a per queue basis though > it is configured globally. If you have say just 100 queues each > holding in memory 1000 flowfiles you have all the attributes of those > 100,000 flowfiles in memory. If each flow file took up just 1KB of > memory we're talking 100+MB. Perhaps a slightly odd example but users > aren't going to go through and think about every queue and the optimal > global swapping setting. Though it is an important number. The > system should be watching them all and doing this automatically. That > could help quite a lot. We may also end up needing to not even have > flowfile attributes held in memory though supporting this would > require API changes to ensure they're only accessed in stream friendly > ways. Doing this for all uses of EL is probably pretty > straightforward but all the direct attribute map accesses would need > consideration. > > ...And we also need to think through things like > > OS Differences in accessing resources > We generally follow "Pure Java (tm)" practices where possible. So > this helps a lot. But still things like accessing specific file paths > as might be needed in flow configurations themselves (GetFile/PutFile > for example) could be tricky (but doable). > > The protocols used to source data matter a lot > With all this talk of back pressure keep in mind that how data gets > into NiFi becomes really critical in these clusters. If you use > protocols which do not afford fault tolerance and load balancing then > things are not great. So protocols which have queuing semantics or > feedback mechanisms or let NiFi as the consumer control things will > work out well. Some portions of JMS are good for this. Kafka is good > for this. NiFi's own site-to-site is good for this. > > The frankencluster testing is a valuable way to force and think > through interesting issues. Maybe the frankencluster as you have it > isn't realistic but it still exposes the concepts that need to be > thought through for cases that definitely are. > > Thanks > Joe > > On Tue, Sep 27, 2016 at 7:37 AM, Joe Skora <jsk...@gmail.com> wrote: > > The images just show what the text described, 13 files queued, EmptyQueue > > returns 0 of 13 removed, and ListQueue returns the queue has no > flowfiles. > > > > There were 13 files of 1k sitting in a queue between a SegmentContent and > > ControlRate. After I sent that email I had to stop/start the processors > a > > couple of times for other things and somewhere in the midst of that the > > queue cleared. > > > > > > > > On Mon, Sep 26, 2016 at 11:05 PM, Peter Wicks (pwicks) < > pwi...@micron.com> > > wrote: > > > >> Joe, > >> > >> I didn’t get the images (might just be my exchange server). How many > files > >> are in the queue? (exact count please) > >> > >> --Peter > >> > >> From: Joe Skora [mailto:jsk...@gmail.com] > >> Sent: Monday, September 26, 2016 8:20 PM > >> To: dev@nifi.apache.org > >> Subject: Questions about heterogeneous cluster and queue > >> problem/bug/oddity in 1.0.0 > >> > >> I have a 3 node test franken-cluster that I'm abusing for the sake of > >> learning. The systems run Ubuntu 15.04, OS X 10.11.6, and Windows 10 > and > >> though far comparable each has quad-core i7 between 2.5 and 3.5 GHz and > >> 16GB of RAM. Two have SSDs and the third has a 7200RPM SATA III drive. > >> > >> 1) Is there any reason mixing operating systems with the cluster would > be > >> a bad idea. Once configured it seems to run ok. > >> 2) Will performance disparities affect reliable ability or performance > >> within the cluster? > >> 3) Are there ways to configure disparate systems such that they can all > >> perform at peak? > >> > >> The bug or issues I have run into is a queue showing files that can't be > >> remove or listed. Screen shots attached below. I don't know if it's a > >> mixed-OS issues, something I did while torturing the systems (all stayed > >> up, this time), or just a weird anomaly. > >> > >> Regards, > >> Joe > >> > >> Trying to empty queue seen in background > >> [Inline image 1] > >> > >> but the flowfiles cannot be deleted. > >> [Inline image 2] > >> > >> But try to list them and it says there are no files in the queue? > >> [Inline image 3] > >> >