[ 
https://issues.apache.org/jira/browse/NIFI-4146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16071481#comment-16071481
 ] 

Joseph Witt commented on NIFI-4146:
-----------------------------------

[~randerzander] just as is the case with SplitText you cannot safely go from 
150K lines to 1 line results.  You need to do a two phase split.  First 
SplitRecord splits into say 1000 or 500 lines and the second phase can split to 
one.

Why?  Because going from a single bundle of 150K records to 150K records means 
you have 151K flowfiles (metadata/references - not content) in memory and that 
can eat up a lot of heap.  By doing the two phase split you would never have 
more than 1001 in memory at a time for example.

We do need to improve this by flushing in-flight sessions with lots of flowfile 
references to disk but we're not there yet.  The suggested approach works well, 
benefits from backpressure and parallel processing, and will get you on track.

> SplitRecord does not gracefully convert medium sized CSV into individual 
> FlowFiles
> ----------------------------------------------------------------------------------
>
>                 Key: NIFI-4146
>                 URL: https://issues.apache.org/jira/browse/NIFI-4146
>             Project: Apache NiFi
>          Issue Type: Bug
>          Components: Core Framework
>            Reporter: Randy Gelhausen
>         Attachments: flow.xml.gz, nifi-app.log, ubuntu.nifi-app.log
>
>
> SplitRecord fails to split a ~= 150k line (57 Mb) CSV file into individual 
> FlowFiles.
> This could be configuration issues, but with a build from master today, I run 
> into problems out of the box on macOS and Linux: 
> On macOS Sierra, I get a too many open files error (See attached 
> nifi-app.log). On Ubuntu 17.04, I get OOMs (See attached ubuntu.nifi-app.log) 
> and the Web UI fails.
> The CSV file I'm using is available 
> [here|https://opendata.arcgis.com/datasets/229220ee14c147659e1049bd517c0b78_16.csv]
>  and I've attached the flow: [^flow.xml.gz].



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to