oddity in 1.0.0

Joe Witt Tue, 27 Sep 2016 05:32:36 -0700

JoeS

I think you are seeing a queue bug that has been corrected or reported
on the 1.x line.

As for the frankencluster concept i think it is generally fair game.
There are a number of design reasons, most notably back pressure, that
make this approach feasible.  So the big ticket items to consider are
things like

CPU
Since the model of NiFi is that basically all processes/tasks are
eligible to run on all nodes and that when configuring the number of
threads and tasks per controller and component that they are applied
to all nodes this could be problematic when there is a substantive
imbalance of power on the various systems.  If this were important to
improve we could allow node-local overrides of max controller threads.
That helps a bit but doesn't really solve it.  Again back pressure is
probably the most effective.  There are probably a number of things we
could do here if needed.

Disk
We have to consider the speed, congestion, and storage available on
the disk(s) and how they're partitioned and such for our various
repositories.  Again back pressure is one of the more effective
mechanisms here because it is all about doing as much as you can which
means other nodes should be able to take on more/less.  Fortunately
the configuration of the repositories and such here are node-local so
we can have pretty considerable variety here and things work pretty
well.

Network
Back pressure for the win.  Though significant imbalances could lead
to significant congestion which could cause inefficiencies in general
so would need to be careful.  That scenario would require wildly
imbalanced node capabilities and very high rate flows most likely.

Memory
JVM Heap size variability and/or off heap memory differences could
cause some nodes to behave wildly different than others in ways that
back pressure will not necessarily solve.  For instance a node with
too low heap size for the types of processes in the flow could yield
order(s) of magnitude lower performance than another node.  We should
do more for these things.  Users should not have to configure things
like swapping thresholds for instance.  We should at runtime determine
and tune those values.  It is simply too hard to find a good magic
number that predicts the likely number of flow file attributes and
size that might be needed and those can have a substantial impact on
heap usage.  Right now we treat swapping on a per queue basis though
it is configured globally.  If you have say just 100 queues each
holding in memory 1000 flowfiles you have all the attributes of those
100,000 flowfiles in memory.  If each flow file took up just 1KB of
memory we're talking 100+MB.  Perhaps a slightly odd example but users
aren't going to go through and think about every queue and the optimal
global swapping setting.  Though it is an important number.  The
system should be watching them all and doing this automatically.  That
could help quite a lot.  We may also end up needing to not even have
flowfile attributes held in memory though supporting this would
require API changes to ensure they're only accessed in stream friendly
ways.  Doing this for all uses of EL is probably pretty
straightforward but all the direct attribute map accesses would need
consideration.

...And we also need to think through things like

OS Differences in accessing resources
We generally follow "Pure Java (tm)" practices where possible.  So
this helps a lot.  But still things like accessing specific file paths
as might be needed in flow configurations themselves (GetFile/PutFile
for example) could be tricky (but doable).

The protocols used to source data matter a lot
With all this talk of back pressure keep in mind that how data gets
into NiFi becomes really critical in these clusters.  If you use
protocols which do not afford fault tolerance and load balancing then
things are not great.  So protocols which have queuing semantics or
feedback mechanisms or let NiFi as the consumer control things will
work out well.  Some portions of JMS are good for this.  Kafka is good
for this.  NiFi's own site-to-site is good for this.

The frankencluster testing is a valuable way to force and think
through interesting issues. Maybe the frankencluster as you have it
isn't realistic but it still exposes the concepts that need to be
thought through for cases that definitely are.

Thanks
Joe

On Tue, Sep 27, 2016 at 7:37 AM, Joe Skora <jsk...@gmail.com> wrote:
> The images just show what the text described, 13 files queued, EmptyQueue
> returns 0 of 13 removed, and ListQueue returns the queue has no flowfiles.
>
> There were 13 files of 1k sitting in a queue between a SegmentContent and
> ControlRate.  After I sent that email I had to stop/start the processors a
> couple of times for other things and somewhere in the midst of that the
> queue cleared.
>
>
>
> On Mon, Sep 26, 2016 at 11:05 PM, Peter Wicks (pwicks) <pwi...@micron.com>
> wrote:
>
>> Joe,
>>
>> I didn’t get the images (might just be my exchange server). How many files
>> are in the queue? (exact count please)
>>
>> --Peter
>>
>> From: Joe Skora [mailto:jsk...@gmail.com]
>> Sent: Monday, September 26, 2016 8:20 PM
>> To: dev@nifi.apache.org
>> Subject: Questions about heterogeneous cluster and queue
>> problem/bug/oddity in 1.0.0
>>
>> I have a 3 node test franken-cluster that I'm abusing for the sake of
>> learning.  The systems run Ubuntu 15.04, OS X 10.11.6, and Windows 10 and
>> though far comparable each has quad-core i7 between 2.5 and 3.5 GHz and
>> 16GB of RAM.  Two have SSDs and the third has a 7200RPM SATA III drive.
>>
>> 1) Is there any reason mixing operating systems with the cluster would be
>> a bad idea.  Once configured it seems to run ok.
>> 2) Will performance disparities affect reliable ability or performance
>> within the cluster?
>> 3) Are there ways to configure disparate systems such that they can all
>> perform at peak?
>>
>> The bug or issues I have run into is a queue showing files that can't be
>> remove or listed.  Screen shots attached below.  I don't know if it's a
>> mixed-OS issues, something I did while torturing the systems (all stayed
>> up, this time), or just a weird anomaly.
>>
>> Regards,
>> Joe
>>
>> Trying to empty queue seen in background
>> [Inline image 1]
>>
>> but the flowfiles cannot be deleted.
>> [Inline image 2]
>>
>> But try to list them and it says there are no files in the queue?
>> [Inline image 3]
>>

Re: Questions about heterogeneous cluster and queue problem/bug/oddity in 1.0.0

Reply via email to