Re: FSFS successor ID design draft

Daniel Shahaf Tue, 30 Aug 2011 09:18:36 -0700

Stefan Sperling wrote on Mon, Aug 29, 2011 at 18:01:38 +0200:
> On Sat, Aug 27, 2011 at 02:15:31AM +0300, Daniel Shahaf wrote:
> > >  - cache can be regenerated on demand
> > 
> > Regenerated *offline*, right?  ie, if the cache is lost it can be
> > derived from the revision files (which remain authoritative), but this
> > process need not withstand concurrent readers.  (The problem being the
> > need to decrement the cached-rev file)
> 
> Well, decrementing the cached-rev file would have to be done offline.
> But the cache can be rebuilt while the repository is active.
> 
> We could have a flag that says whether the cache is active.
> E.g. after an upgrade from an older repository format, the cache
> would be flagged inactive, until it has been populated.
> 
> When rebuilding the cache we'd take the repository offline,
> mark the cache inactive, remove the cached-rev file, go online,
> and rebuild the cache. Readers will get information for older
> revisions only until the cache has been rebuilt.
>


We mentioned a format number for the cache files; this "active flag"
almost smells like a format number for the entire cache... (with
format 0 being "inactive")

> > > The index size and offsets are 64 bit values.
> > 
> > Stored as?  ASCII decimal?  (If so: how is it delimited or padded?)  Big
> > endian binary?
> 
> Some binary format. Fixed size.
> 

Fair enough.

> > Q: How do you update the 'offset of free space' in a manner that is
> > consistent to readers?
> > 
> > ie, if the offset was ASCII ' 800' and the new offset is '1600', you
> > must avoid a reader accidentally seeing '1800'.
> 
> The offsets appear at the start of a disk block.
> They should be a fixed-sized binary value -- ASCII has problems
> as you've pointed out.
> 
> In any case, I am assuming that flushing data to disk flushes one disk
> block. I.e. we can update one block atomically during flush.
> Maybe this assumption isn't valid.
> 

ack

> > (I seem to remember some ideas exchanged on the revprop-packing design
> > discussion --- including one that involved having two offsets written
> > down in the file, and one of them *always* being valid (but not always
> > the same one) --- perhaps Peter will remember...)
> 
> Do you have a link to the archives?
> 

Not exactly, sorry.  Most discussions happened in this (huge) thread:

    From: kmra...@rockwellcollins.com
    To: dev@subversion.apache.org
    Subject: Does fsfs revprop packing no longer allow usage of traditional 
backup
            software?
    Message-ID: 
<off5dc1071.793efd7c-on862578bf.005304da-862578bf.00552...@rockwellcollins.com>

The crux of the idea, though, is to have two byte locations for the
offset --- say, store an integer at byte 4-7 and another at bytes 8-11
--- and have some logic to choose which one of those to read and treat
as the offset.  In the meantime the other integer can be updated
arbitrarily.

> > So, invariant: given a node-rev, all its s-blocks, except perhaps the
> > last (in order of appearance in the file), have at most <size of the
> > longest possible node-rev-id, minus one> free (zero) bytes at the end.
> 
> Did you mean "at least"?
>  

No.  I said that "All completed s-blocks have at most O(1) free bytes
at the end".

> > > It updates the 'next s-block' offset in the previous s-block to the
> > > offset of the new s-block, and flushes the cache file to disk.
> > > 
> > 
> > At what point does the writer update the cached-rev file?
> 
> The writer doesn't update the cached-rev file. This is done by
> the global cache sync code.
> 
> I forgot to include two steps for the writer, though.
> It needs to update the offset to the last s-block in the index
> 

You forgot to include "two steps", but you only list one step here.  Did
you forgot to list the other step which you had originally forgotten to
list?

> > So, a lost block is in the file.  You could avoid that by seeking, not
> > to the end of the file but to 
> > 
> > SELECT (MAX(offset to my last s-block) + SIZEOF(s-block)) FROM index
> > 
> > (and, yes, I realize that this is a crash situation. two answers to
> > that: first, it's easy to avoid having an unused block; second, if
> > a file grows an unused block then a DOS can translate to running out of
> > disk space.)
> 
> I don't really understand you here.
> Are you suggesting that the code should check for a dead s-block
> first, before appending to the file?
> 

I'd like to avoid growing 'holes' in the middle of the file that readers
will never access.  (I don't want to require dump|load in order to lose
such holes.)

> That MAX doesn't make sense. Each node_rev_id only appears once
> in the index. Maybe that wasn't clear yet?  There is exacrty one entry
> for each node-rev-id from revision N in the cache file for revision N.
>  

Clear.

> > >     for node_rev in $rev {
> > >       obtain predecessor p_node_rev of $node_rev
> > >       obtain revision p_rev from $p_node_rev
> > 
> > Should the cache note which of the successors is the M-successor (the
> > one with the same copy-id)?  Perhaps by stipulating that the M-successor
> > is always the first in the list of successors, and leaving room for it
> > if a C-successor (a copy) is created before the M-successor.
> 
> We don't know how long the M-successor ID is going to be.
> 

Actually we can bound it, but the bound I have in mind is going to be
a large overestimate for the common case.

> If we want to optimise locating the M-successor, we could add the
> offset to the s-block which contains the M-successor to the index,
> and use a special flag byte to terminate the M-successor string.
> 

This is probably going to be more space-efficient, yes.

> > (It does mean we have two codepaths though, eg both packed and
> > non-packed shards; and <cue wild idea> perhaps it'll be easier to have
> > *only* packed shards, and to use the expensive replay method to answer
> > queries about revisions younger than the latest pack file?  (We may want
> > to use a smaller shard size in this case.)
> > 
> > And, by the way, this idea isn't /so/ unwieldy.  I think that you can
> > use *only* pack files --- and also use pack files for unfinished shards
> > --- if, when you run out of INDEX_SIZE, you rewrite the whole file and
> > move-into-place a new pack file (with a doubled INDEX_SIZE) on top of
> > the old pack file.  The advantage here would be not having the
> > both-packed-and-nonpacked-shards headache to deal with.)
> 
> Might be possible. Not sure if that's worth it.
> 
> > >       write $rev into 'cached-rev' file
> > >       flush 'cached-rev' file
> > 
> > No, you don't flush, you move-into-place.
> 
> I decided to avoid move-into-place on purpose.
> 
> This is my biggest question:
> 
> Does FSFS presently move stuff into place if the move target location
> already exists? I am asking because I don't think this would work on
> Windows. We cannot overwrite a file if a reader has opened it.
> Or do we really use the delete-retry-loop on Windows servers?
> 
> The only situation I can think of is where a previous commit created
> a revision file and didn't update 'current'. In which case it's unlikely
> that the revision file will be open when the next commits moves a file
> with the same name into place.
> 

As I said on IRC: look into revprop editing in 1.6 and 1.7 FSFS.

(and probably the locks/ hierarchy, too; see fsfs/lock.c)

> > Okay, so the cache reading procedure MUST change, to account for the
> > non-packed cache file vanishing (by a concurrent packer) and the pack
> > file appearing instead.  This is another layer of headache on top of the
> > base one :-(
> 
> Do we support packing while the repository is live?

Yes.

> I haden't considered this.
> 
> Thanks for your feedback!

Welcome.

Re: FSFS successor ID design draft

Reply via email to