Re: server-side log cache

Stefan Fuhrmann Mon, 19 Sep 2011 01:43:36 -0700

On 29.08.2011 18:35, Stefan Sperling wrote:

Sorry for the late response. I have been knocked
out for a while.

On Sun, Aug 28, 2011 at 03:46:03PM +0200, Stefan Fuhrmann wrote:

See 
http://svn.apache.org/repos/asf/subversion/branches/fs-successor-ids/BRANCH-README
for what this is all about.

But the assumptions in that file are actually not valid.

Which ones are invalid? Can you explain in detail?

It makes at least 5 implicit assumptions:


* Noderevs are the relevant level on which user-level
  questions can be answered.
-> That might not be the case. Noderevs are a purely
   internal construct. In particular, a noderev might get
   copied but the actual delta is empty. Thinking about
   svn obliterate & friends that might actually happen
   as a result of some node removal / replacement.

* User-level queries (where got node copied to?) and
  internal queries (what are the copy dependencies?)
  are looking for the same information.
-> This is linked to the previous. Noderev copy-to info
   can be useful to determine the dependency graph.
   The point here is that we are mixing queries on two
   different layers here. If we need both, they should
   both exist as separate APIs.

* Noderevs changes are relevant for all svn obliterate.
-> That may not always be true. If you only want to
   remove "unused" history, the references in the skip-
   delta tree express the relevant dependencies.

* Noderev changes are the sufficient to answer the
  "where has this been copied to?" question.
-> They are quite unhelpful here. Noderevs might
   help find you changes on the node itself (see first
   assumption). But you need to evaluate the user-
   level copy operations of all parent folders for all
   currently known copies. There are far less of these
   copies (e.g. branch creation) than noderev changes
   (typ. multiple per revision).

* A noderev-based cache alone would reduce the
  amount of data to inspect / process for these queries.
-> The noderev cache is *larger* than e.g. a log cache
   due to each change also modifying all parent nodes.
   To be efficient, you need a node-level copy-to cache
   as well (see previous item).

* "Where did path P at rev N to M directly or indirectly
  copied to, e.g. which releases contain a certain faulty
  code segment; optionally list changes to targets?"
->  needs to scan parent paths for copies, too
   (reversed "log", revision graph)

Yes, the successor-id cache only gives us operation roots.
Information for child nodes needs to be derived -- it is not
within the scope of the cache itself.

That is unnecessarily complicated / expensive.
Using the proposed log cache, the relevant
information can be found in a *single*, highly
efficient scan.

It turns out that we can produce even humongous
reverse logs (50+ k nodes, thousands of branches
and tags) within a second by simply performing
a full history scan.

A example of how the whole process can be
implemented efficiently, can be found here:

https://tortoisesvn.googlecode.com/svn/trunk/src/TortoiseProc/RevisionGraph/FullGraphBuilder.cpp

I'll take a look at that, thanks!

The gist of it is that you keep all currently known
copy targets (e.g. branches, tags) in a tree. For
each revision, you walk that tree to find the affected
sub-tree. The LRU path gets optimized to speed up
the next look-up.

A usually fixed repo layout plus typical change
patterns result in a quasi-constant processing
time for each revision - independent from the
number of copies already found.

Things become much simpler and faster if you
eliminate string operations and use path IDs
instead. TSVN does an "svn log" at over 10M
revs per second.

Storage of successor IDs is essentially a cache of the result of the
inversion of the predecessor relation between all node-revisions.
This inversion is a *very* expensive operation (follow every node
that ever existed in the repository from its most recent revision
back to the revision it was created in).

Not true. At least the "*very* expensive" part.
Log -v takes about 1min for AFS over ra_local.
Building any of the index data described below
can be done in<  10s.

Any of it (if so, which parts?), or all of it?

All of it, of course. The expensive part is "svn log -v"
which may be expensive on slow disks. Everything
else is just writing a few MB of data.

I propose a modified version of TSVN's log cache
data structure. The adaptations:

* split into multiple files to reduce modification overhead
* remove rev-prop-dependent information (log comment,
   author, timestamp)
* add the reverse copy info
* simplify the data structures

This looks very interesting.

What about FSFS-specific requirements?

See assumptions above, this may require a different
data structure. But I think that noderev dependencies
will turn out to be redundant if you have a log cache
and access to the skip-delta forwards dependencies.

It sounds like you avoid those by storing data in semantics of the repos
layer (path@revision) instead of the FS layer (node-revision-id)?

Yes.

In this case separate implementations for FSFS and BDB aren't needed.
This could be an advantage (e.g. third party FS implementations
wouldn't need to change to support this).

It also eliminates on of the performance weaknesses
of SVN today: A log on some old / seldom changed
path can take a very long time.


I'll think about this some more, thanks.

Welcome ;)

-- Stefan^2.

Re: server-side log cache

Reply via email to