Hi Martin,

Generally, NiFi processor doesn't load entire content of file and is
capable of handling huge files.
However, having massive amount of FlowFiles can cause OOM issue as
FlowFiles and its Attributes resides on heap.

I assume you are using 'Line Split Count' as 1 at SplitText.
We recommend to use multiple SplitText processors to not generate many
FlowFiles in a short period of time.
For example, 1st SplitText splits files per 5,000 lines, then the 2nd
SplitText splits into each line.
This way, we can decrease number of FlowFiles at a given time
requiring less heap.

I hope this helps.

Thanks,
Koji

On Wed, May 31, 2017 at 6:20 PM, Martin Eden <[email protected]> wrote:
> Hi all,
>
> I have a vanilla Nifi 1.2.0 node with 1GB of heap.
>
> The flow I am trying to run is:
> ListHDFS -> FetchHDFS -> SplitText -> RouteOnContent -> MergeContent ->
> PutHDFS
>
> When I give it a 300MB input zip file (2.5GB uncompressed) I am getting
> Java OutOfMemoryError as below.
>
> Does NiFi read in the entire contents of files in memory? This is
> unexpected. I thought it is chunking through files. Giving more ram is not
> a solution as you can always get larger input files in the future.
>
> Does this mean NiFi is not suitable as a scalable ETL solution?
>
> Can someone please explain what is happening and how to mitigate large
> files in NiFi? Any patterns?
>
> Thanks,
> M
>
> ERROR [Timer-Driven Process Thread-9]
> o.a.nifi.processors.standard.SplitText
> SplitText[id=e16939ca-f28f-1178-b66e-054e43a0a724]
> SplitText[id=e16939ca-f28f-1178-b66e-054e43a0a724] failed to process
> session due to java.lang.OutOfMemoryError: Java heap space: {}
>
> java.lang.OutOfMemoryError: Java heap space
>
>         at java.util.HashMap$EntrySet.iterator(HashMap.java:1013)
>
>         at java.util.HashMap.putMapEntries(HashMap.java:511)
>
>         at java.util.HashMap.<init>(HashMap.java:489)
>
>         at
> org.apache.nifi.controller.repository.StandardFlowFileRecord$Builder.initializeAttributes(StandardFlowFileRecord.java:219)
>
>         at
> org.apache.nifi.controller.repository.StandardFlowFileRecord$Builder.addAttributes(StandardFlowFileRecord.java:234)
>
>         at
> org.apache.nifi.controller.repository.StandardProcessSession.putAllAttributes(StandardProcessSession.java:1723)
>
>         at
> org.apache.nifi.processors.standard.SplitText.updateAttributes(SplitText.java:367)
>
>         at
> org.apache.nifi.processors.standard.SplitText.generateSplitFlowFiles(SplitText.java:320)
>
>         at
> org.apache.nifi.processors.standard.SplitText.onTrigger(SplitText.java:258)
>
>         at
> org.apache.nifi.processor.AbstractProcessor.onTrigger(AbstractProcessor.java:27)
>
>         at
> org.apache.nifi.controller.StandardProcessorNode.onTrigger(StandardProcessorNode.java:1118)
>
>         at
> org.apache.nifi.controller.tasks.ContinuallyRunProcessorTask.call(ContinuallyRunProcessorTask.java:144)
>
>         at
> org.apache.nifi.controller.tasks.ContinuallyRunProcessorTask.call(ContinuallyRunProcessorTask.java:47)
>
>         at
> org.apache.nifi.controller.scheduling.TimerDrivenSchedulingAgent$1.run(TimerDrivenSchedulingAgent.java:132)
>
>         at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>
>         at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
>
>         at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
>
>         at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
>
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>
>         at java.lang.Thread.run(Thread.java:748)

Reply via email to