I'm bumping this hoping for some feedback before I dive back into the
ticket.

Lacking any response for 30 days, I figure this either got overlooked due
to year-end or no one has an opinion to add to the discussion (which seems
unlikely).  ;-)



On Tue, Dec 27, 2016 at 2:50 PM, Joe Skora <[email protected]> wrote:

> All,
>
> Before the change to the schema based repositories committed, I was doing
> some testing for NIFI-1847 Improve Provenance Space Utilization
> <https://issues.apache.org/jira/browse/NIFI-1847> based on these
> assumptions.
>
>    - A partition {{nifi.provenance.repository.directory.XYZ}} entry would
>    only be individually tracked if there was a corresponding {{
>    nifi.provenance.repository.directorySize.XYZ
>    <http://nifi.provenance.repository.directorySize.XYZ>}} entry,
>    otherwise it will only be considered against the aggregate totals.
>    - The original {{nifi.provenance.repository.max.storage.size}}
>    property would represent an aggregate across all partitions, whether
>    specifically tracked or not.
>    - Tracked partitions will be evaluated first and their sizes
>    accumulated to avoid double work.
>
>
> My testing showed improved use of space by partition, but also showed two
> problems.
>
>    - Calling the OS for the size of every journal, partition, and index
>    file is expensive so I'm looking at going to the OS every Nth pass and
>    tracking delta writes in between.
>    - Writers are chosen based on round robin which is far from optimal
>    when the size and available space varies by partition.  I some thoughts but
>    haven't put anything in code yet.
>
>
> Considering that provenance recording seems to be a bottleneck on some
> flows, this needs to be as fast as possible but while staying 100%
> reliable.  So, any thoughts on these issues or wisdom relating to
> repositories and provenance is appreciated.
>
> Thanks,
> Joe
>

Reply via email to