All,
Before the change to the schema based repositories committed, I was doing
some testing for NIFI-1847 Improve Provenance Space Utilization
<https://issues.apache.org/jira/browse/NIFI-1847> based on these
assumptions.
- A partition {{nifi.provenance.repository.directory.XYZ}} entry would
only be individually tracked if there was a corresponding {{
nifi.provenance.repository.directorySize.XYZ}} entry, otherwise it will
only be considered against the aggregate totals.
- The original {{nifi.provenance.repository.max.storage.size}} property
would represent an aggregate across all partitions, whether specifically
tracked or not.
- Tracked partitions will be evaluated first and their sizes accumulated
to avoid double work.
My testing showed improved use of space by partition, but also showed two
problems.
- Calling the OS for the size of every journal, partition, and index
file is expensive so I'm looking at going to the OS every Nth pass and
tracking delta writes in between.
- Writers are chosen based on round robin which is far from optimal when
the size and available space varies by partition. I some thoughts but
haven't put anything in code yet.
Considering that provenance recording seems to be a bottleneck on some
flows, this needs to be as fast as possible but while staying 100%
reliable. So, any thoughts on these issues or wisdom relating to
repositories and provenance is appreciated.
Thanks,
Joe