Richard,

So the order of the children may be important for some people. It certainly is 
reasonable to care
about the order in which the children were created.

The larger concern, though, would be that if we moved to a Set such as HashSet, 
the difference
in the amount of heap consumed is pretty remarkably different. Since this 
collection is sometimes
quite large, a Set would be potentially problematic.

That said, with the approach that you are taking, I don't think you're going to 
get the result that
you are looking for, because as you remove the FlowFiles, the events generated 
for them are
also removed. So you won't end up getting any Provenance events anyway.

One possible way to achieve what you are looking for is to instead emit each of 
those FlowFiles
individually and then use a MergeContent processor to merge the FlowFiles back 
together.
Using this approach, though, you will certainly run into heap concerns if you 
are trying to merge
500,000 FlowFiles in a single iterations. Typically, the approach that we would 
follow is to merge
say 10,000 FlowFiles at a time and then have a subsequent MergeContent that 
would merge
together 50 of those 10,000-FlowFile-bundles.

Thanks
-Mark


> On Apr 15, 2016, at 11:57 AM, Richard Miskin <[email protected]> wrote:
> 
> Hi,
> 
> I’m trying to track down a performance problem that I’ve spotted with a 
> custom NiFi processor that I’ve written. When triggered by an incoming 
> FlowFile, the processor loads many (up to about 500,000) records from a 
> database and produces an output file in a custom format. I’m trying to 
> leverage NiFi provenance to track what has gone into the merged file, so the 
> processor creates individual FlowFiles for each database record parented from 
> the incoming FlowFile and with various attributes set. The output FlowFile is 
> then created as a merge of all the database record FlowFiles.
> 
> As I don’t require the individual database record FlowFiles outside the 
> processor I call session.remove(Collection<FlowFile>) rather than 
> transferring them. This works fine for small numbers of records, but the call 
> to remove gets very slow as the number of FlowFiles increases, taking over a 
> minute for 100,000 records.
> 
> I need to do some further testing be sure of the cause, but looking through 
> the code I see that StandardProvenanceEventRecord.Builder contains a 
> List<String> to hold the child uuids. The call to session.remove() eventually 
> calls down to List.remove(), which will get progressively slower as the List 
> grows.
> 
> Given the entries in the List<String> are uuids, could this reasonably be 
> changed to be a Set<String>? Presumably there should never be duplicates, but 
> does the order of entries matter?
> 
> Regards,
> Richard

Reply via email to