Ravi, We process large numbers of files and noticed the I/O overhead as well. I put the content, FlowFiles and provenance repositories on SSD and was then able to achieve high enough throughput with our workloads to consistently saturate a 1Gb pipe with lots of parallel processing, which was fast enough for our purposes.
I'd rather have the provenance history available for support and troubleshooting so it's a trade off that makes sense in our case, as SSD is cheap enough now. Rick > On Sep 5, 2015, at 9:57 PM, ravi mannan <[email protected]> > wrote: > > > Hello, > I am new to apache nifi (and java, somewhat), so forgive my ignorance. I > want to use nifi to process large files. My university has some research work > that could be written as nifi processors. I know "large" can be a relative > term, but for the resources I have, these files are large. About 700mb-1gb. > I *had* processors that took these large files and split them into smaller > files, and then performed our algorithms (which I ported over to java), and > then extracted some data and finally did almost an "if-else" branching. Seems > great for nifi, right? Unfortunately I am seeing problems where it seems that > nifi is i/o bound. I think this is because the walog and provenance log and > all the transactions that are recorded there after each processor. I then > essentially combined my processes into a large one, which I know goes against > the grain of nifi. > I wonder if I should use some of these: > SideEffectFree > SupportsBatching > > and maybe I should've used SupportsBatching instead of throwing my code into > one big processor? > Even if the above does help with the i/o problems, I still like having a > large nifi processor that does a lot of work. One benefit to using one big > processor is that I can use the Callback in session.read to load my large > files into a java.nio.ByteBuffer. A ByteBuffer is useful in my case because I > can have my data off heap and run my algorithms a chunk (or ".slice()") at a > time. Hopefully you are familiar with this class. If I had many small > processors (with the grain), I would have to constantly use session.read and > InputStreamCallback and read the InputStream which is not as efficient as > using a ByteBuffer. > > If I'm reading things correctly, FlowFileController.getContent returns an > InputStream, so that's not bad. My concern is that I will have many > processors (with many threads) reading InputStreams and then having many > objects waiting to be garbage collected. > > > So as you can imagine I am going off into tangents, I was wondering if you > have ideas to reduce the i/o I'm seeing and what you think of my use of > ByteBuffers. I wonder if I could pass the ByteBuffer around in a FlowFile? > Thanks, > Ravi
