RE: [RFC] Inheritable Properties Branch -- caching props in WC DB

Bert Huijben Thu, 13 Sep 2012 14:10:53 -0700


> -----Original Message-----
> From: Julian Foad [mailto:[email protected]]
> Sent: donderdag 13 september 2012 22:18
> To: Paul Burba
> Cc: Subversion Development
> Subject: Re: [RFC] Inheritable Properties Branch -- caching props in WC DB
> 
> Paul Burba wrote:
> 
> > [1] https://svn.apache.org/repos/asf/subversion/branches/inheritable-
> props
> 
> This branch caches properties of certain repository nodes, in a new place
in
> the WC DB.
> 
> I haven't examined it at all yet.  The Wiki page says: "The cache will be
stored
> in a new BLOB column in the NODES table called 'inherited_props'.  If the
> node is the base node of a WC root then this column is populated with a
> serialized skel of the node's inherited properties.  In all other cases it
is
> NULL."
> 
> Another new special column in the NODES table... the twenty-third
> column.  Do we have to?  Is it not complex enough already?
> 
> I want to make a radical suggestion.
> 
> What would it take to be able to share some data instead of creating a
> special-purpose column just for this cache?  Well, it would take more work
> initially, but hear me out, as I have a hunch it may make less work in the
end.
> 
> I think we would do well to factor out the storage of all data about a
> repository node-rev, moving that into a separate table that is
*referenced*
> from the NODES table.  That new table would be indexed by node-rev (that
> is: repos_id, rev,
> relpath) and have a column for the properties and perhaps other columns
> for other attributes of the node-rev.


That would be two additional table lookups per retrieval inside Sqlite3,
while an empty column costs negligible storage. A primary key that is not an
unique integer requires an additional lookup outside the extra lookup for
the foreign key handling.

Have you investigated the performance costs of this proposal?

I heard similar suggestions about properties before, but additional table
lookups are certainly not free in Sqlite. They are not expensive by itself,
but every extra db transaction is certainly a performance hit.

> 
> What we need to know here (for inherited props) is the properties set on
> each ancestor of ever WC "root node" (which is defined elsewhere and
> includes switched nodes etc.)  That's not a complex requirement.  Instead
of
> creating a new column dedicated to storing the union of all the props
> inherited from every ancestor node, how about we fetch and concatenate,
> on request, the node-rev table rows for all the pathwise ancestors of the
> desired node.  The cache maintenance then becomes ensuring that a set of
> node-rev table rows are populated (populated in the standard way for such
> rows), and that those rows are removed when no longer referenced.
> 
> But this particular use is not the only reason for factoring out this data
into a
> separate table.  There is:
> 
>   - For inherited props, this use.
> 
>   - For inherited props, if the current model of BASE node inheritance
> remains, we're also going to need to cache the inherited props for every
> mixed-rev node (that is, every node whose rev != its parent-in-BASE
> rev).  (I'll write separately about why we need this for off-line
operation.)

I think this case was explicitly ignored.

This would require that operations like commit start updating parts of the
working copy with information that is not available locally. Normally the
commit shouldn't be able to introduce invalid inherited state (or the commit
would fail). 

And an explicit update of a subtree already adds the necessary information
to the root (if the current state of the code still matches what I read
before)

>   - We already have (I think) the case where a local WC-to-WC copy or move
> of a tree results in poulating another set of tree nodes in the NODES
table
> with an identical copy of all the data.  Whenever we can reduce the amount
> of such copying, if the additional complexity is not too great, that can
make it
> easier to achieve both correctness and efficiency.

When we copy from BASE to another layer we don't have to copy the inherited
properties. The inheritance in the new location will be based on the new
url, not the original location. So the current queries already handle this
case by not copying the data from the BASE layer.

>   - For conflicts, it's useful to store the "theirs" and "old base"
> versions of the conflicted node.  Its kind, its properties, the reference
to its
> (pristine) text, and perhaps other attributes.

On trunk everything is prepared to store conflicts in a skel, what has this
to do with the inherited property storage?

In format 30 we already have the urls and properties of the conflicted nodes
(not the pristine text yet, but with the url we can retrieve that or we can
implement storing the sha1 references in the separate columns as initially
planned)

> 
>   - I think it would make the WC *much* more maintainable.  At the moment,
> there doesn't appear to be any codified connection between those eight
> columns in NODES.  There doesn't even seem to be any comment in the doc
> strings to indicate that.  It's almost impossible to analyze the huge
query
> statements to check that they're all updated together.
> 
> Therefore I propose a WC development that adds a new table (call it
> NODEREV?), indexed by repos/rev/relpath, with eight further columns that
> will be moved out of the current NODES table:
> 
>   repos_id, revision, repos_path -- the index columns
>   kind
>   properties
>   checksum
>   symlink_target
>   changed_revision
>   changed_date
>   changed_author
>   translated_size

(translated_size or really recorded_size shouldn't be in this list as it can
have a different value for the same file if it is checked out multiple
times)

> Then we add an indirection from the NODES table via the index columns
> whenever we access those columns.  And some means of controlling
> creation and deletion of those rows, perhaps using a ref-count.

Ok, so we add a ... guess ... guess ... 30% slowdown of our read queries and
even more for cascading updates to provider better maintainability, while a
typical checkout working copy only contains nodes once?

I did a lot of performance work to get at the overall 15% speedup we have
since 1.7. Do you have any ideas on how much of that we will lose after this
suggested refactoring?


I'm no fan of adding yet another layer in the working copy to just
theoretically improve maintainability. That layer will also require
maintenance, while it is for 99.999% a 1 on 1 relation. And in other cases
we are slowly moving to a decentralized DVCS and maybe the storage should
then also move to such a model.

> I acknowledge that such a development is far from trivial, and I don't
expect
> Paul to do it as part of this work.  I just want to get the proposal out
there
> and see if there is any consensus for working toward something like it.
> 
> In terms of the inheritable props branch, I wonder if there's any way we
can
> move a bit closer to this proposal without going the whole hog, but I
can't
> think of anything.

I don't think this proposal, the inherited properties and/or the
conflicts-skels proposals are directly related.


And before we look at this or similar refactorings I think we should measure
the impact on some 'simple' case like 'svn status', that would have at least
an additional lookup for every layer of every node for the foreign key
handling.

        Bert


> - Julian

RE: [RFC] Inheritable Properties Branch -- caching props in WC DB

Reply via email to