All,

Before the change to the schema based repositories committed, I was doing
some testing for NIFI-1847 Improve Provenance Space Utilization
<https://issues.apache.org/jira/browse/NIFI-1847> based on these
assumptions.

   - A partition {{nifi.provenance.repository.directory.XYZ}} entry would
   only be individually tracked if there was a corresponding {{
   nifi.provenance.repository.directorySize.XYZ}} entry, otherwise it will
   only be considered against the aggregate totals.
   - The original {{nifi.provenance.repository.max.storage.size}} property
   would represent an aggregate across all partitions, whether specifically
   tracked or not.
   - Tracked partitions will be evaluated first and their sizes accumulated
   to avoid double work.


My testing showed improved use of space by partition, but also showed two
problems.

   - Calling the OS for the size of every journal, partition, and index
   file is expensive so I'm looking at going to the OS every Nth pass and
   tracking delta writes in between.
   - Writers are chosen based on round robin which is far from optimal when
   the size and available space varies by partition.  I some thoughts but
   haven't put anything in code yet.


Considering that provenance recording seems to be a bottleneck on some
flows, this needs to be as fast as possible but while staying 100%
reliable.  So, any thoughts on these issues or wisdom relating to
repositories and provenance is appreciated.

Thanks,
Joe

Reply via email to