Hi,

I’m trying to track down a performance problem that I’ve spotted with a custom 
NiFi processor that I’ve written. When triggered by an incoming FlowFile, the 
processor loads many (up to about 500,000) records from a database and produces 
an output file in a custom format. I’m trying to leverage NiFi provenance to 
track what has gone into the merged file, so the processor creates individual 
FlowFiles for each database record parented from the incoming FlowFile and with 
various attributes set. The output FlowFile is then created as a merge of all 
the database record FlowFiles.

As I don’t require the individual database record FlowFiles outside the 
processor I call session.remove(Collection<FlowFile>) rather than transferring 
them. This works fine for small numbers of records, but the call to remove gets 
very slow as the number of FlowFiles increases, taking over a minute for 
100,000 records.

I need to do some further testing be sure of the cause, but looking through the 
code I see that StandardProvenanceEventRecord.Builder contains a List<String> 
to hold the child uuids. The call to session.remove() eventually calls down to 
List.remove(), which will get progressively slower as the List grows.

Given the entries in the List<String> are uuids, could this reasonably be 
changed to be a Set<String>? Presumably there should never be duplicates, but 
does the order of entries matter?

Regards,
Richard

Reply via email to