NODE_DATA (was: Chat with Erik Huelsmann)

Greg Stein Mon, 05 Jul 2010 13:58:54 -0700

Erik and I chatted on IM about the NODE_DATA table (aka "4th tree"). Figured
it would be a good thing to capture that here to the dev@ list. Below is our
chat, with only a few (personal) redactions.

There is some more conversation, which I'll forward separately...

Cheers,
-g

---------- Forwarded message ----------
From: Erik Huelsmann
Date: Mon, Jul 5, 2010 at 16:38
Subject: Chat with Erik Huelsmann
To: [email protected]

 [...]
 Erik: From the conversation the past weeks, I see there are 3 big items
remaining for 1.7; one of which is the "4th tree"
15:45 me: yup
 Erik: I was pondering the subject, but thought I'd tell you where my sudden
interest comes from before jumping into the porcelain
 me: hehe
  search your archives for NODE_DATA
  that'll provide msot of the basics of the thinking
15:46 my last note said something about including copyfrom data in the
NODE_DATA table, but I'm nuking that idea
 Erik: I read up on it this afternoon, or at least quite a bit of it.
  ok.
  I think there are 2 ways of viewing NODE_DATA:
  1. as a stash for 'non-current' layers
15:47 2. as a table which holds all layers including the current one
  from what I read about it, your thinking is (2)?
 me: yes
  the data moves from both BASE_NODE and WORKING_NODE into that table,
15:48 and with particular queries, you get the "latest" node of the tree
  latest/most-current/topmost-layer/whatever
 Erik: k. if we make mistakes clearing out the table, that's probably the
best way to notice early :-)
15:49 me: :-)
15:50 Erik: About using it for BASE_NODE as well as WORKING: there's no
intention to share records between copied parts of the tree though, right?
  I mean, it'll all still be keyed on the local_relpath
 me: correct
 Erik: ok. because if not, I was expecting issues with e.g. presence
 me: yah. not even gonna try that. rows will be copied when a copy/move
occurs.
15:51 Erik: Ok. Mind me writing up some of the thoughts in a mail? It should
end up being a proposal for change of the current schema.
15:52 me: please go ahead! sure
 Erik: How far away do you expect yourself to be from moving to something
other than cleaning svn_wc__get_entry()?
15:53 me: I've got a few days of writing in-db props tests, then to bump
that format,
  then to work on NODE_DATA
15:54 one complication is entries upgrading,
  we have no more entry_modify() calls,
  but we still have to write old entries into the db,
  and that is done (today) using sql statements,
  which will need to switch over to updating NODE_DATA as appropraite
15:56 Erik: ok. and from Bert, I understood that's the same time when we get
feature parity with 1.6 (ie being able to replace parts of the tree)
 me: yah. we have a couple problems with adds-under-copy. a couple other
sequences.
15:57 Erik: philip was expecting problems from the "multi-copysource"
paradigm used to model mixed-rev WCs.
  especially because you can't tell they're part of the same op.
  do you see that differently?
15:58 me: I had thought to put the copyfrom_relpath/rev into the NODE_DATa
table to do mixed revs under one op,
  but am going back on that idea,
  and sticking to multiple ops in WORKING_NODE,
15:59 where each op specifies a different rev,
  and yes... that will cause problems to detect "single op",
  but that model is what we need for *commit* time,
 Erik: the question should probably be "do we need to".
  right.
 me: because when committing, we issue a new COPY for each operation in the
WORKING table,
16:00 and so... yah. "fine. it looks like different ops.",
  but do we care?
  it is entirely possible that a person ended up in that state with TRUE
multiple operations,
  or it is possible to reach taht state from a single mixed-rev copy,
 Erik: when looking at it from a commit point of view, everything is part of
the same -yet uncreated - transaction, I guess.
 me: and I think it is important to NOT contain that kind of history,
  yes
  but the biggest user-visible feature,
  is "revert",
16:01 because you can only revert at operation roots,
  not children of those,
  so a mixed-rev copy will create multiple operation roots,
  which can then be independently reverted,
  but this can cause a problem because the ancestor node that has a
different revision,
 Erik: do we need elision for those? What if everything is updated to the
same rev?
 me: doesn't have the now-reverted descendent at that ancestor revision,
16:02 so when reverting in this situation, and there are no other layers of
NODE_DATA to provide the data,
  then you have to mark the node as excluded,
  so parent is r5, child is r7, and you revert the child,
  it now becomes an r5/excluded child,
  but even then... that might not be quite right because the child was
created in r6,
16:03 so maybe it just becomes a not-present node...
 Erik: what happens to the excluded/not-present nodes during a commit? Do
they get copied with the parent, if in the repos? Or are they deleted, if
present in the repos?
 me: post-copy, an update will not unify the revisions. they are still
copies of distinct revisions
  copied with the parent
16:04 consider two operations: svn cp A newA ; svn cp B newA/B
  well... insert an 'svn rm newA/B' in between
16:05 if newA/B is reverted, then the commit has a copy of A including all
of its children
 Erik: right.
 me: now... if you reach that same state via a single mixed-rev copy, then
you revert the newA/B "copy", then a commit should contain all of newA
  (well, as a copy of a...@some-rev)
16:06 iow, history says whether you have child data after that child-revert,
  and you don't in a mixed-rev copy,
  so you have to leave something there. I think that is not-present
16:07 Erik: it's not excluded (because that assumes existence) or deleted
(same reason)
16:08 so, you need something which says "it might or might not be here in
the repos, but I don't have it"
 me: which is not-present. we report not-present to the server, and it will
send stuff if something should be there. or it will NOT send something, and
we remove the not-present node in that case.
 Erik: sounds like absent, although currently that may assume existence too.
 me: no. absent is a misnomer for not-authorized.
16:09 that's on a todo list for renaming.
16:10 Erik: ok. I get the context. Let's see if I can get something in a
mail about it.
 me: not-present basically means "this is a versioned node at
*some*revision. we don't have its details here right now."
16:11 it primarily appears when you commit the deletion of a file. the
parent's rev implies its existence, but post-commit-rev it does not exist.
the parent has to have some kind of marker about the child, and that is
not-present
16:12 Erik: hmm. does our editor have this notion?
16:13 me: the not-present concept is part of the Reporter, not the Editor
  we report not-present nodes to the server. when the server gives us stuff
to update our working copy, the editor will put a node there, or say nothing
about it.
16:15 (and as I said, post-update, we remove all not-present nodes; if the
server said nothing about them, then they are not part of the target
revision, so we can safely remove them)
  now... part of the issue is that we're talking about copy/move children,
16:16 rather than nodes living in the BASE tree,
  so I'd advise creating a new presence for these, for clarity sake.
  in fact, during a conversation at some point, I suggested expanding the
set of presence values. include copy-here/move-here/moved-away into the
present set,
  I erroneously thought it Good to keep a minimal set,
16:17 but that isn't really true. and it means we need to do a
scan_addition/scan_deletion to get data,
  when we may only need the status that is easily derived from the presence
value
  (obviously, you always have to scan for an operation root; tho... with
NODE_DATA and the op_depth, the scan is more of a *skip* :-) )
16:18 Erik: sounds sane.
  we might want to remove the scans by expanding the presence set if
NODE_DATA is going quickly enough.
16:19 [...]
 me: yeah
 Erik: but it would be nice to carve out NODE_DATA before then.
  [...]
 me: anyways. yeah. in the course of adding NODE_DATA, then we can also
expand the set of presence values to assist with various types of data
lookups
16:26 (scanning will still be necessary, but we may be able to improve the
algorithm)
16:28 Erik: to recap: NODE_DATA contains the information to link a BASE_NODE
or WORKING_NODE to its repository location.
 me: no
16:29 only BASE nodes have repository locations.
  the table has about eight columns. I listed those out in one of the
emails.
16:31 kind, [checksum], changed_*, properties, [symlink_target]
  7 columns of data. and then the key is <wc_id, local_relpath, op_depth>
16:32 we may also want to put translated_size and last_mod_time into
NODE_DATA
  s/may/probably/

NODE_DATA (was: Chat with Erik Huelsmann)

Reply via email to