I'm bumping this hoping for some feedback before I dive back into the ticket.
Lacking any response for 30 days, I figure this either got overlooked due to year-end or no one has an opinion to add to the discussion (which seems unlikely). ;-) On Tue, Dec 27, 2016 at 2:50 PM, Joe Skora <[email protected]> wrote: > All, > > Before the change to the schema based repositories committed, I was doing > some testing for NIFI-1847 Improve Provenance Space Utilization > <https://issues.apache.org/jira/browse/NIFI-1847> based on these > assumptions. > > - A partition {{nifi.provenance.repository.directory.XYZ}} entry would > only be individually tracked if there was a corresponding {{ > nifi.provenance.repository.directorySize.XYZ > <http://nifi.provenance.repository.directorySize.XYZ>}} entry, > otherwise it will only be considered against the aggregate totals. > - The original {{nifi.provenance.repository.max.storage.size}} > property would represent an aggregate across all partitions, whether > specifically tracked or not. > - Tracked partitions will be evaluated first and their sizes > accumulated to avoid double work. > > > My testing showed improved use of space by partition, but also showed two > problems. > > - Calling the OS for the size of every journal, partition, and index > file is expensive so I'm looking at going to the OS every Nth pass and > tracking delta writes in between. > - Writers are chosen based on round robin which is far from optimal > when the size and available space varies by partition. I some thoughts but > haven't put anything in code yet. > > > Considering that provenance recording seems to be a bottleneck on some > flows, this needs to be as fast as possible but while staying 100% > reliable. So, any thoughts on these issues or wisdom relating to > repositories and provenance is appreciated. > > Thanks, > Joe >
