Ravi,

We process large numbers of files and noticed the I/O overhead as well.  I put 
the content, FlowFiles and provenance repositories on SSD and was then able to 
achieve high enough throughput with our workloads to consistently saturate a 
1Gb pipe with lots of parallel processing, which was fast enough for our 
purposes.

I'd rather have the provenance history available for support and 
troubleshooting so it's a trade off that makes sense in our case, as SSD is 
cheap enough now.

Rick



> On Sep 5, 2015, at 9:57 PM, ravi mannan <[email protected]> 
> wrote:
> 
> 
> Hello,
> I am new to apache nifi (and java, somewhat), so forgive my ignorance.  I 
> want to use nifi to process large files. My university has some research work 
> that could be written as nifi processors.  I know "large" can be a relative 
> term, but for the resources I have, these files are large. About 700mb-1gb.  
> I *had* processors that took these large files and split them into smaller 
> files, and then performed our algorithms (which I ported over to java), and 
> then extracted some data and finally did almost an "if-else" branching. Seems 
> great for nifi, right? Unfortunately I am seeing problems where it seems that 
> nifi is i/o bound. I think this is because the walog and provenance log and 
> all the transactions that are recorded there after each processor. I then 
> essentially combined my processes into a large one, which I know goes against 
> the grain of nifi.
> I wonder if I should use some of these:
> SideEffectFree
> SupportsBatching
> 
> and maybe I should've used SupportsBatching instead of throwing my code into 
> one big processor?
> Even if the above does help with the i/o problems, I still like having a 
> large nifi processor that does a lot of work. One benefit to using one big 
> processor is that I can use the Callback in session.read to load my large 
> files into a java.nio.ByteBuffer. A ByteBuffer is useful in my case because I 
> can have my data off heap and run my algorithms a chunk (or ".slice()") at a 
> time.  Hopefully you are familiar with this class. If I had many small 
> processors (with the grain), I would have to constantly use session.read and 
> InputStreamCallback and read the InputStream which is not as efficient as 
> using a ByteBuffer.
> 
> If I'm reading things correctly, FlowFileController.getContent returns an 
> InputStream, so that's not bad. My concern is that I will have many 
> processors (with many threads) reading InputStreams and then having many 
> objects waiting to be garbage collected. 
> 
> 
> So as you can imagine I am going off into tangents, I was wondering if you 
> have ideas to reduce the i/o I'm seeing and what you think of my use of 
> ByteBuffers. I wonder if I could pass the ByteBuffer around in a FlowFile?
> Thanks,
> Ravi

Reply via email to