> -----Original Message----- > From: Julian Foad [mailto:julianf...@btopenworld.com] > Sent: donderdag 13 september 2012 22:18 > To: Paul Burba > Cc: Subversion Development > Subject: Re: [RFC] Inheritable Properties Branch -- caching props in WC DB > > Paul Burba wrote: > > > [1] https://svn.apache.org/repos/asf/subversion/branches/inheritable- > props > > This branch caches properties of certain repository nodes, in a new place in > the WC DB. > > I haven't examined it at all yet. The Wiki page says: "The cache will be stored > in a new BLOB column in the NODES table called 'inherited_props'. If the > node is the base node of a WC root then this column is populated with a > serialized skel of the node's inherited properties. In all other cases it is > NULL." > > Another new special column in the NODES table... the twenty-third > column. Do we have to? Is it not complex enough already? > > I want to make a radical suggestion. > > What would it take to be able to share some data instead of creating a > special-purpose column just for this cache? Well, it would take more work > initially, but hear me out, as I have a hunch it may make less work in the end. > > I think we would do well to factor out the storage of all data about a > repository node-rev, moving that into a separate table that is *referenced* > from the NODES table. That new table would be indexed by node-rev (that > is: repos_id, rev, > relpath) and have a column for the properties and perhaps other columns > for other attributes of the node-rev.
That would be two additional table lookups per retrieval inside Sqlite3, while an empty column costs negligible storage. A primary key that is not an unique integer requires an additional lookup outside the extra lookup for the foreign key handling. Have you investigated the performance costs of this proposal? I heard similar suggestions about properties before, but additional table lookups are certainly not free in Sqlite. They are not expensive by itself, but every extra db transaction is certainly a performance hit. > > What we need to know here (for inherited props) is the properties set on > each ancestor of ever WC "root node" (which is defined elsewhere and > includes switched nodes etc.) That's not a complex requirement. Instead of > creating a new column dedicated to storing the union of all the props > inherited from every ancestor node, how about we fetch and concatenate, > on request, the node-rev table rows for all the pathwise ancestors of the > desired node. The cache maintenance then becomes ensuring that a set of > node-rev table rows are populated (populated in the standard way for such > rows), and that those rows are removed when no longer referenced. > > But this particular use is not the only reason for factoring out this data into a > separate table. There is: > > - For inherited props, this use. > > - For inherited props, if the current model of BASE node inheritance > remains, we're also going to need to cache the inherited props for every > mixed-rev node (that is, every node whose rev != its parent-in-BASE > rev). (I'll write separately about why we need this for off-line operation.) I think this case was explicitly ignored. This would require that operations like commit start updating parts of the working copy with information that is not available locally. Normally the commit shouldn't be able to introduce invalid inherited state (or the commit would fail). And an explicit update of a subtree already adds the necessary information to the root (if the current state of the code still matches what I read before) > - We already have (I think) the case where a local WC-to-WC copy or move > of a tree results in poulating another set of tree nodes in the NODES table > with an identical copy of all the data. Whenever we can reduce the amount > of such copying, if the additional complexity is not too great, that can make it > easier to achieve both correctness and efficiency. When we copy from BASE to another layer we don't have to copy the inherited properties. The inheritance in the new location will be based on the new url, not the original location. So the current queries already handle this case by not copying the data from the BASE layer. > - For conflicts, it's useful to store the "theirs" and "old base" > versions of the conflicted node. Its kind, its properties, the reference to its > (pristine) text, and perhaps other attributes. On trunk everything is prepared to store conflicts in a skel, what has this to do with the inherited property storage? In format 30 we already have the urls and properties of the conflicted nodes (not the pristine text yet, but with the url we can retrieve that or we can implement storing the sha1 references in the separate columns as initially planned) > > - I think it would make the WC *much* more maintainable. At the moment, > there doesn't appear to be any codified connection between those eight > columns in NODES. There doesn't even seem to be any comment in the doc > strings to indicate that. It's almost impossible to analyze the huge query > statements to check that they're all updated together. > > Therefore I propose a WC development that adds a new table (call it > NODEREV?), indexed by repos/rev/relpath, with eight further columns that > will be moved out of the current NODES table: > > repos_id, revision, repos_path -- the index columns > kind > properties > checksum > symlink_target > changed_revision > changed_date > changed_author > translated_size (translated_size or really recorded_size shouldn't be in this list as it can have a different value for the same file if it is checked out multiple times) > Then we add an indirection from the NODES table via the index columns > whenever we access those columns. And some means of controlling > creation and deletion of those rows, perhaps using a ref-count. Ok, so we add a ... guess ... guess ... 30% slowdown of our read queries and even more for cascading updates to provider better maintainability, while a typical checkout working copy only contains nodes once? I did a lot of performance work to get at the overall 15% speedup we have since 1.7. Do you have any ideas on how much of that we will lose after this suggested refactoring? I'm no fan of adding yet another layer in the working copy to just theoretically improve maintainability. That layer will also require maintenance, while it is for 99.999% a 1 on 1 relation. And in other cases we are slowly moving to a decentralized DVCS and maybe the storage should then also move to such a model. > I acknowledge that such a development is far from trivial, and I don't expect > Paul to do it as part of this work. I just want to get the proposal out there > and see if there is any consensus for working toward something like it. > > In terms of the inheritable props branch, I wonder if there's any way we can > move a bit closer to this proposal without going the whole hog, but I can't > think of anything. I don't think this proposal, the inherited properties and/or the conflicts-skels proposals are directly related. And before we look at this or similar refactorings I think we should measure the impact on some 'simple' case like 'svn status', that would have at least an additional lookup for every layer of every node for the foreign key handling. Bert > - Julian