Hey Joe, Sorry - I don't think I saw this. I have actually been working on NIFI-3356 [1] for which I hope to have a PR up in the next few days. I've been doing some long-running tests, and I did find an issue yesterday so I've redeployed to some nodes to let it run over the weekend. If all looks good I can perhaps have a PR in on Monday.
The Persistent Provenance Repository is quite old. At the time that it was written, the requirements were simply to store data in a sequential fashion and make it available for a Reporting Task to iterate over the events sequentially. There was no compression, and there was no indexing/searching. The requirements clearly have changed over the years :) So I started working on a totally new implementation and my testing shows that it is 2-3 times faster than the Persistent Provenance Repository while at the same time providing faster query capabilities and immediate access to events (as opposed to after a 30- second rollover period). When I get a chance to get it posted, it would be great if you want to put it through the ringer as well. I say all of this, because if you are interested, it may be worth holding off a few days and looking into implementing something similar to the new repo instead of focusing on the PersistentProvenanceRepository (or updating both). Thanks -Mark [1] https://issues.apache.org/jira/browse/NIFI-3356 On Jan 27, 2017, at 9:42 AM, Joe Skora <[email protected]<mailto:[email protected]>> wrote: I'm bumping this hoping for some feedback before I dive back into the ticket. Lacking any response for 30 days, I figure this either got overlooked due to year-end or no one has an opinion to add to the discussion (which seems unlikely). ;-) On Tue, Dec 27, 2016 at 2:50 PM, Joe Skora <[email protected]<mailto:[email protected]>> wrote: All, Before the change to the schema based repositories committed, I was doing some testing for NIFI-1847 Improve Provenance Space Utilization <https://issues.apache.org/jira/browse/NIFI-1847> based on these assumptions. - A partition {{nifi.provenance.repository.directory.XYZ}} entry would only be individually tracked if there was a corresponding {{ nifi.provenance.repository.directorySize.XYZ <http://nifi.provenance.repository.directorySize.XYZ>}} entry, otherwise it will only be considered against the aggregate totals. - The original {{nifi.provenance.repository.max.storage.size}} property would represent an aggregate across all partitions, whether specifically tracked or not. - Tracked partitions will be evaluated first and their sizes accumulated to avoid double work. My testing showed improved use of space by partition, but also showed two problems. - Calling the OS for the size of every journal, partition, and index file is expensive so I'm looking at going to the OS every Nth pass and tracking delta writes in between. - Writers are chosen based on round robin which is far from optimal when the size and available space varies by partition. I some thoughts but haven't put anything in code yet. Considering that provenance recording seems to be a bottleneck on some flows, this needs to be as fast as possible but while staying 100% reliable. So, any thoughts on these issues or wisdom relating to repositories and provenance is appreciated. Thanks, Joe
