[ 
https://issues.apache.org/jira/browse/NIFI-7510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17127441#comment-17127441
 ] 

Mark Payne edited comment on NIFI-7510 at 6/6/20, 10:33 PM:
------------------------------------------------------------

[~garfield69au] this sounds like the correct behavior. When you split the data 
using SplitRecord, it adds a series of "fragment.*" attributes. One of those is 
"fragment.count". This is the total number of "splits" being generated. When 
you use MergeRecord with Defragment mode, it is waiting to receive however many 
splits are specified by the "fragment.count" attribute. So this works fine 
without RouteOnAttribute. But when you add in RouteOnAttribute, you are 
filtering out some of the FlowFiles, so all of the FlowFiles don't arrive at 
MergeRecord. Thus, it will continue waiting until it receives them all without 
merging anything together.

That said, the notion of splitting the data apart and then merging it back 
together again, especially with Record-based processors, is something that you 
should avoid if at all possible. Doing that results in far more work for the 
NiFi engine and provides far worse performance (commonly an order of magnitude 
difference in performance) than when you keep records together in a single 
FlowFile.

So your flow could probably look something more like:

GetFile -> ConvertExcelToCSV -> UpdateRecord -> ExecuteSQLRecord -> PutFile

It's not entirely clear to me if you need to explicitly treat records 
differently if (count = 0) or if the intent is just to filter them out. But you 
can adapt the flow to handle that, if necessary, by adding "PartitionRecord -> 
RouteOnAttribute" in between ExecuteSQLRecord and PutFile.


was (Author: markap14):
[~garfield69au] this sounds like the correct behavior. When you split the data 
using SplitRecord, it adds a series of "fragment.*" attributes. One of those is 
"fragment.count". This is the total number of "splits" being generated. When 
you use MergeRecord with Defragment mode, it is waiting to receive however many 
splits are specified by the "fragment.count" attribute. So this works fine 
without RouteOnAttribute. But when you add in RouteOnAttribute, you are 
filtering out some of the FlowFiles, so all of the FlowFiles don't arrive at 
MergeRecord. Thus, it will continue waiting until it receives them all without 
merging anything together.

I do think we can improve MergeRecord by allowing for a "Bin Expiration 
Strategy" that is either "Merge Available FlowFiles" or "Fail Available 
FlowFiles." Right now, if the Max Bin Age is reached, the FlowFiles that belong 
together go to failure. If we added this new property, it would allow for the 
user to specify that the Processor should merge together whatever it can and 
just ignore the rest.

That said, the notion of splitting the data apart and then merging it back 
together again, especially with Record-based processors, is something that you 
should avoid if at all possible. Doing that results in far more work for the 
NiFi engine and provides far worse performance (commonly an order of magnitude 
difference in performance) than when you keep records together in a single 
FlowFile.

So your flow could probably look something more like:

GetFile -> ConvertExcelToCSV -> UpdateRecord -> ExecuteSQLRecord -> PutFile

It's not entirely clear to me if you need to explicitly treat records 
differently if (count = 0) or if the intent is just to filter them out. But you 
can adapt the flow to handle that, if necessary, by adding "PartitionRecord -> 
RouteOnAttribute" in between ExecuteSQLRecord and PutFile.

> MergeRecords not working after RouteOnAttribute
> -----------------------------------------------
>
>                 Key: NIFI-7510
>                 URL: https://issues.apache.org/jira/browse/NIFI-7510
>             Project: Apache NiFi
>          Issue Type: Bug
>    Affects Versions: 1.11.4
>         Environment: MS Windows 10
>            Reporter: Shane Downey
>            Priority: Minor
>
> Here's the scenario:
> I need to process an arbitrary length spreadsheet or CSV file.
> For each row in the file, I need to extract a value from a column called 
> "Key" and pass it into a SQL Query. If the query returns 0 rows I need to 
> write the source "Key" into an exception file. 
> For the successful queries, I need to write the full output to a spreadsheet 
> file.
>  
> Setup:
> GetFile-->ConvertExceltoCSVProcessor-->SplitRecord(CSVReader/JSONRecordSetWriter)->
> ExtractText("Key"/Value)-->UpdateAttribute("arg.1"=value)->
> ExecuteSQLRecord()-->RouteOnAttribute(rowcount=0)->LogAttribute
> RouteOnAttribute(rowcount!=0)-->MergeRecords(Defragment)->PutFile
>  
> In the above scenario, the FlowFile blocks in the queue leading into 
> MergeRecords (never actually makes it into MergeRecords).
>  
> If I change the merge strategy to Bin then data does flow through to the 
> output file in "chunks" of rows.
> If I remove RouteOnAttribute from the flow, the solution works as expected 
> and my output file is created as expected.
>  
> Thus - there appears to be an issue using RouteOnAttribute within a 
> SplitRecord/MergeRecord block and using the defragment merge strategy. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to