[ 
https://issues.apache.org/jira/browse/NIFI-1438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15184899#comment-15184899
 ] 

Pierre Villard commented on NIFI-1438:
--------------------------------------

OK so I think I figured out what is going on. I believe that there is a 
bottleneck at RouteOnContent processor. The input flow to your MergeProcessor 
is not big enough to meet your requirements (5000 flow files in less than 10 
seconds).

With the flow I reproduced (see attachement), the RouteOnContent processor 
ingests about 641 flow files by seconds (on my computer). So the processor 
outputs about 64 flow files (with "xyz") by seconds and you will have files 
with, at the end, about 650 lines if you merge every 10 seconds.

By modifying the RouteOnContent processor and introducing a "batch size" saying 
that the processor should work with X flow files at each onTrigger call 
(instead of 1), I get 1390 flow files ingested by seconds with a batch size 
property set to 1000.

If I change the MergeProcessor to merge data every 60 seconds (if the 5000 flow 
files have not been reached), I correctly output files with 5000 lines each.

If the batch size thing is of any interest, I can submit a PR. And maybe some 
other optimizations could be done on this processor.

Regarding "There are also a number of files with a "uuid.json" style name, all 
containing one line.", I don't know but it could be because the output file 
already exists and MergeProcessor uses the uuid as new file name. TBC.

> Unexpected results using MergeProcessor 
> ----------------------------------------
>
>                 Key: NIFI-1438
>                 URL: https://issues.apache.org/jira/browse/NIFI-1438
>             Project: Apache NiFi
>          Issue Type: Bug
>    Affects Versions: 0.4.1
>         Environment: OSX 10.10.5, Java 8u45
>            Reporter: Josh Harrison
>         Attachments: nifi-merge-problem.xml, nifi-problem.tgz
>
>
> Hello, I'm opening a ticket in reference to the stack overflow question I had 
> at 
> http://stackoverflow.com/questions/34958347/mergecontent-with-nifi-inconsistent-length
>  
> To summarize, despite Aldrin's help, I have been unable to get the expected 
> merge behavior out of a template like the one attached, ingesting data like 
> is attached. 
> The goal is to ingest all of the zips in /tmp/nifidemo/source, extract the 
> zip files contained therein, each line being a json object. With json 
> routing, I extract and route for further processing ONLY items where the 
> "tags" item contains the tag "xyz".
> These routed files should be aggregated by "mergeContent" into a bucket with, 
> at minimum, 1000 lines – or after being starved for 30 seconds, whatever 
> occurs first.
> The behavior observed in my real template is replicated in this example – 
> merge content appears to be routing to buckets based on the original file 
> name, and not aggregating 1000 lines at a time as expected. Within a few 
> seconds of the template being run, many files are written with unexpected 
> line counts.
> More confusingly, this isn't a consistent pattern - files may be run 
> repeatedly and do not generate the same number of lines in the result each 
> time.
> The content of the input files was randomly generated so that approximately 
> 10% of the objects would contain the tag "xyz" (5000 lines in each input 
> file, there should be approximately 500 lines of – there are result files 
> that contain over 400 lines, but many contain 15-30 lines. There are also a 
> number of files with a "uuid.json" style name, all containing one line. 
> The attached contains a generic template that replicates the problem – it 
> seems to throw some errors but they don't appear to be related to the problem 
> I'm working on (and my real template doesn't throw the failures, but still 
> exhibits the same behavior).
> I am running Nifi 0.4.1 on a Mac OSX 10.10.5 system and JRE 8u45.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to