Re: Thoughts on NIFI-1847 Improve Provenance Space Utilization

Mark Payne Fri, 27 Jan 2017 06:59:13 -0800

Hey Joe,

Sorry - I don't think I saw this. I have actually been working on NIFI-3356 [1] 
for which
I hope to have a PR up in the next few days. I've been doing some long-running 
tests,
and I did find an issue yesterday so I've redeployed to some nodes to let it 
run over the
weekend. If all looks good I can perhaps have a PR in on Monday.


The Persistent Provenance Repository is quite old. At the time that it was 
written, the requirements
were simply to store data in a sequential fashion and make it available for a 
Reporting Task to iterate
over the events sequentially. There was no compression, and there was no 
indexing/searching. The
requirements clearly have changed over the years :) So I started working on a 
totally new implementation
and my testing shows that it is 2-3 times faster than the Persistent Provenance 
Repository while at the
same time providing faster query capabilities and immediate access to events 
(as opposed to after a 30-
second rollover period).

When I get a chance to get it posted, it would be great if you want to put it 
through the ringer as well.
I say all of this, because if you are interested, it may be worth holding off a 
few days and looking into
implementing something similar to the new repo instead of focusing on the 
PersistentProvenanceRepository
(or updating both).

Thanks
-Mark


[1] https://issues.apache.org/jira/browse/NIFI-3356



On Jan 27, 2017, at 9:42 AM, Joe Skora 
<[email protected]<mailto:[email protected]>> wrote:

I'm bumping this hoping for some feedback before I dive back into the
ticket.

Lacking any response for 30 days, I figure this either got overlooked due
to year-end or no one has an opinion to add to the discussion (which seems
unlikely).  ;-)



On Tue, Dec 27, 2016 at 2:50 PM, Joe Skora 
<[email protected]<mailto:[email protected]>> wrote:

All,

Before the change to the schema based repositories committed, I was doing
some testing for NIFI-1847 Improve Provenance Space Utilization
<https://issues.apache.org/jira/browse/NIFI-1847> based on these
assumptions.

  - A partition {{nifi.provenance.repository.directory.XYZ}} entry would
  only be individually tracked if there was a corresponding {{
  nifi.provenance.repository.directorySize.XYZ
  <http://nifi.provenance.repository.directorySize.XYZ>}} entry,
  otherwise it will only be considered against the aggregate totals.
  - The original {{nifi.provenance.repository.max.storage.size}}
  property would represent an aggregate across all partitions, whether
  specifically tracked or not.
  - Tracked partitions will be evaluated first and their sizes
  accumulated to avoid double work.


My testing showed improved use of space by partition, but also showed two
problems.

  - Calling the OS for the size of every journal, partition, and index
  file is expensive so I'm looking at going to the OS every Nth pass and
  tracking delta writes in between.
  - Writers are chosen based on round robin which is far from optimal
  when the size and available space varies by partition.  I some thoughts but
  haven't put anything in code yet.


Considering that provenance recording seems to be a bottleneck on some
flows, this needs to be as fast as possible but while staying 100%
reliable.  So, any thoughts on these issues or wisdom relating to
repositories and provenance is appreciated.

Thanks,
Joe

Re: Thoughts on NIFI-1847 Improve Provenance Space Utilization

Reply via email to