Github user markap14 commented on the issue:
https://github.com/apache/nifi/pull/2416
@devriesb that was the approach that I started with. However, to ensure
correctness, it turned not to be a tweak to the existing implementation.
Initially, my thoughts were along the lines of:
Multiple Partitions
- Each FlowFile must be sticky to a partition.
- PartitionId = Hash<Record ID> % numPartitions.
But now consider the following scenario:
MergeContent merges 2 FlowFiles (A and B) to create a new FlowFile (C).
Requires updates to Partition 1, 2 for transfer to 'original'; Partition 3
for new FlowFile.
If Partitions 1 & 2 are flushed to disk but not 3, then on restart we have
transferred FlowFiles
to 'original' relationship without the newly created FlowFile, so we
have data loss. This is not OK.
To avoid this, we can consider that the transaction is not complete unless
all partitions in the transaction agree.
Now consider, though, that Partitions 1 & 2 are flushed to disk. Partition
3 is not. Then FlowFile A is updated again, this time by itself.
Partition 1 is flushed to disk but not Partition 3.
Now on restart, we throw out the 'merge' transaction because Partition 3
was not flushed to disk. But we carry on with the subsequent transaction
for FlowFile A, effectively missing one of the updates to the FlowFile.
*same situation we are in now*.
To avoid this from happening, we have to consider that since our 'merge'
transaction didn't complete (encountered EOF), then all subsequent
transactions for the FlowFiles involved in the 'merge' transaction have to
be rolled back. This still leaves us with the following possibility, though:
FlowFile A and B were merged into C. A and B were transferred to
original.
This transaction failed as described above because Partition 3 was not
flushed to disk.
Now, FlowFile A was forked into 5 children. These 5 children were
written to Partitions 4 and 5.
Partitions 1, 4, and 5 are flushed to disk, but not Partition 3.
On restart, we now see that Partition 3 was not flushed, so we discard
the 'merge' transaction and
the FlowFiles are placed back on the queue for MergeContent. But now we
have children of FlowFile A
running through the flow, and we will now re-merge A with other data.
Additionally, in such a case, we have the overhead of not only writing to
potentially many files for a single update to the repository but also having to
write to a transaction log that is a single point through which all updates
must go, and this can amount to quite an expensive update. This amounts to more
than tweak to the code, unfortunately.
Given the criticality this branch of the code, correctness is definitely
more important than performance, but performance is still critical and we
should strive to be both correct and high performance.
Given that, I started looking at an alternative, to serialize the data to
byte arrays outside of any lock contention, then performing only the write of
that data to a file within the contended portion of code.
The good news is that a lot of the serialization and error handling is
copied & pasted from the existing implementation and then modified to fit a
cleaner software design. So while it is a new implementation, there is still a
large amount of reuse.
---