Ravi, Tangents are fine. We're here to help and there are a ton of ways to tackle these sorts of things.
So here are some general thoughts that help with processing data at high rate: 1) Consider how you have NiFi laid out on the physical system specifically with regard to disk usage. You can, for example, have the flowfile repo, content repo, and prov repo on their own physical disks. You can even have multiple content and prov repos to spread the load even further as needed. So truly you should expect hundreds of MB per second of read/write rate. On even a slow, single disk, all repos on the same disk config you should see 50+ MB/s. 2) Consider the effect on the garbage collector. Loading lots of really large objects on the heap creates pressure and can lead to more frequent and costly full garbage collections - of course this isn't always true as its always about the use case. I'd avoid being too eager to go with direct allocation on ByteBuffers unless it is for long-lived reference type data. If it is for just processing a large object which you are already able to break into chunks the Input/Output stream concept NiFi exposes is ideal for this. In such a case you shouldn't even have to split up the large objects because the amount you'll ever have in memory should be no larger than the largest object contained within that set. In short, there are often very efficient ways to process data. 3) The WAL for FlowFiles and prov repo should not become an issue unless you have completely saturated the disks they run on or unless the transaction rate is so high they simply cannot keep up. 4) Whether you do one monolith processor or whether you break them up does certainly impact performance but mostly it is about reuse and ease of troubleshooting. This concept (how you compose processors) is something that will evolve over time as you learn more and find valuable reuse and so on. You should not worry too much about it yet. The most effective abstractions often revealed - not invented. Troubleshooting: - How are you measuring the IO? Are you doing this in Linux using something like 'iostat'? - Are you sure the issue you're seeing is truly IO and not garbage collection? Watching GC behavior with jstat can be extremely helpful. Which version of NiFI are you on? In the latest snapshot/codebase which is NIFI 0.3.0-snapshot a massive performance change has occurred which could potentially be a factor for you. It wasn't actually an IO loading issue but rather just an implementation problem which caused nifi to artificially slow like 'stop the world pauses' in the JVM. It has been remedied and I've personally seen 4x or more improvements for certain use cases. So let's iterate through this for a while and see if we can't get you where you want to be performance wise. Let's start with understanding how you're observing high IO and what the GC behavior is. Thanks Joe On Sat, Sep 5, 2015 at 10:16 PM, ravi mannan <[email protected]> wrote: > > Hello, > I am new to apache nifi (and java, somewhat), so forgive my ignorance. I > want to use nifi to process large files. My university has some research work > that could be written as nifi processors. I know "large" can be a relative > term, but for the resources I have, these files are large. About 700mb-1gb. > I *had* processors that took these large files and split them into smaller > files, and then performed our algorithms (which I ported over to java), and > then extracted some data and finally did almost an "if-else" branching. Seems > great for nifi, right? Unfortunately I am seeing problems where it seems that > nifi is i/o bound. I think this is because the walog and provenance log and > all the transactions that are recorded there after each processor. I then > essentially combined my processes into a large one, which I know goes against > the grain of nifi. > I wonder if I should use some of these: > SideEffectFree > SupportsBatching > > and maybe I should've used SupportsBatching instead of throwing my code into > one big processor? > Even if the above does help with the i/o problems, I still like having a > large nifi processor that does a lot of work. One benefit to using one big > processor is that I can use the Callback in session.read to load my large > files into a java.nio.ByteBuffer. A ByteBuffer is useful in my case because I > can have my data off heap and run my algorithms a chunk (or ".slice()") at a > time. Hopefully you are familiar with this class. If I had many small > processors (with the grain), I would have to constantly use session.read and > InputStreamCallback and read the InputStream which is not as efficient as > using a ByteBuffer. > > If I'm reading things correctly, FlowFileController.getContent returns an > InputStream, so that's not bad. My concern is that I will have many > processors (with many threads) reading InputStreams and then having many > objects waiting to be garbage collected. > > > So as you can imagine I am going off into tangents, I was wondering if you > have ideas to reduce the i/o I'm seeing and what you think of my use of > ByteBuffers. I wonder if I could pass the ByteBuffer around in a FlowFile? > Thanks, > Ravi
