Ravi,

Tangents are fine.  We're here to help and there are a ton of ways to
tackle these sorts of things.

So here are some general thoughts that help with processing data at high rate:
1) Consider how you have NiFi laid out on the physical system
specifically with regard to disk usage.  You can, for example, have
the flowfile repo, content repo, and prov repo on their own physical
disks.  You can even have multiple content and prov repos to spread
the load even further as needed.  So truly you should expect hundreds
of MB per second of read/write rate.  On even a slow, single disk, all
repos on the same disk config you should see 50+ MB/s.

2) Consider the effect on the garbage collector.  Loading lots of
really large objects on the heap creates pressure and can lead to more
frequent and costly full garbage collections - of course this isn't
always true as its always about the use case.  I'd avoid being too
eager to go with direct allocation on ByteBuffers unless it is for
long-lived reference type data.  If it is for just processing a large
object which you are already able to break into chunks the
Input/Output stream concept NiFi exposes is ideal for this.  In such a
case you shouldn't even have to split up the large objects because the
amount you'll ever have in memory should be no larger than the largest
object contained within that set.  In short, there are often very
efficient ways to process data.

3) The WAL for FlowFiles and prov repo should not become an issue
unless you have completely saturated the disks they run on or unless
the transaction rate is so high they simply cannot keep up.

4) Whether you do one monolith processor or whether you break them up
does certainly impact performance but mostly it is about reuse and
ease of troubleshooting.  This concept (how you compose processors) is
something that will evolve over time as you learn more and find
valuable reuse and so on.  You should not worry too much about it yet.
The most effective abstractions often revealed - not invented.

Troubleshooting:
- How are you measuring the IO?  Are you doing this in Linux using
something like 'iostat'?
- Are you sure the issue you're seeing is truly IO and not garbage
collection?  Watching GC behavior with jstat can be extremely helpful.

Which version of NiFI are you on?  In the latest snapshot/codebase
which is NIFI 0.3.0-snapshot a massive performance change has occurred
which could potentially be a factor for you.  It wasn't actually an IO
loading issue but rather just an implementation problem which caused
nifi to artificially slow like 'stop the world pauses' in the JVM.  It
has been remedied and I've personally seen 4x or more improvements for
certain use cases.

So let's iterate through this for a while and see if we can't get you
where you want to be performance wise.  Let's start with understanding
how you're observing high IO and what the GC behavior is.

Thanks
Joe

On Sat, Sep 5, 2015 at 10:16 PM, ravi mannan
<[email protected]> wrote:
>
> Hello,
> I am new to apache nifi (and java, somewhat), so forgive my ignorance.  I 
> want to use nifi to process large files. My university has some research work 
> that could be written as nifi processors.  I know "large" can be a relative 
> term, but for the resources I have, these files are large. About 700mb-1gb.  
> I *had* processors that took these large files and split them into smaller 
> files, and then performed our algorithms (which I ported over to java), and 
> then extracted some data and finally did almost an "if-else" branching. Seems 
> great for nifi, right? Unfortunately I am seeing problems where it seems that 
> nifi is i/o bound. I think this is because the walog and provenance log and 
> all the transactions that are recorded there after each processor. I then 
> essentially combined my processes into a large one, which I know goes against 
> the grain of nifi.
> I wonder if I should use some of these:
> SideEffectFree
> SupportsBatching
>
> and maybe I should've used SupportsBatching instead of throwing my code into 
> one big processor?
> Even if the above does help with the i/o problems, I still like having a 
> large nifi processor that does a lot of work. One benefit to using one big 
> processor is that I can use the Callback in session.read to load my large 
> files into a java.nio.ByteBuffer. A ByteBuffer is useful in my case because I 
> can have my data off heap and run my algorithms a chunk (or ".slice()") at a 
> time.  Hopefully you are familiar with this class. If I had many small 
> processors (with the grain), I would have to constantly use session.read and 
> InputStreamCallback and read the InputStream which is not as efficient as 
> using a ByteBuffer.
>
> If I'm reading things correctly, FlowFileController.getContent returns an 
> InputStream, so that's not bad. My concern is that I will have many 
> processors (with many threads) reading InputStreams and then having many 
> objects waiting to be garbage collected.
>
>
> So as you can imagine I am going off into tangents, I was wondering if you 
> have ideas to reduce the i/o I'm seeing and what you think of my use of 
> ByteBuffers. I wonder if I could pass the ByteBuffer around in a FlowFile?
> Thanks,
> Ravi

Reply via email to