Re: reftable [v4]: new ref storage format

2017-08-05 Thread Shawn Pearce
On Tue, Aug 1, 2017 at 6:51 PM, Michael Haggerty  wrote:
> On Tue, Aug 1, 2017 at 4:27 PM, Shawn Pearce  wrote:
>> On Mon, Jul 31, 2017 at 11:41 PM, Michael Haggerty  
>> wrote:
>>> On Sun, Jul 30, 2017 at 8:51 PM, Shawn Pearce  wrote:
 4th iteration of the reftable storage format.
 [...]
>>>
>>> Before we commit to Shawn's reftable proposal, I wanted to explore
>>> what a contrasting design that is not block based would look like.
>>
>> I forgot to look at a 1k chunk size, as you suggested that might also
>> be suitable. Here is the more complete experiment table:
>>
>>| size   | seek_cold | seek_hot  |
>> mh  1k | 29.4 M | 20.6 usec | 10.7 usec |   <== Row fixed
>> mh  4k | 28.3 M | 24.5 usec | 14.5 usec |
>> sp  4k | 29.2 M | 63.0 usec |  5.8 usec |
>> sp 64k | 27.7 M | 35.6 usec | 23.3 usec |

I modified reftable to use a mutli-level index as recommended by
Michael Haggerty, and re-ran the above table:

fmt |  bs | idx  |  size  | seek_cold | seek_hot  |
|-|--||---|---|
 mh |  1k | 4 lv | 29.4 M | 20.1 usec |  7.1 usec |
 sp |  1k | 4 lv | 30.7 M | 21.0 usec |  5.5 usec |

 mh |  4k | 2 lv | 28.3 M | 23.4 usec | 11.2 usec |
 sp |  4k | 2 lv | 29.2 M | 19.9 usec |  5.4 usec |

 sp |  4k | 1 lv | 29.2 M | 62.9 usec |  5.6 usec |
 sp | 64k | 1 lv | 27.7 M | 35.6 usec | 21.6 usec |

fmt:  mh = Michael's proposal, sp = Shawn's reftable
bs:  chunk size or block size in bytes
idx:  how many levels of index
size:  total file size in bytes

I can't explain the slightly slower sp-1k-4lv vs. mh-1k-4lv cold_seek
in the first two rows. It might simply be the larger footer read
slowing down JGit. Its reliably flipped in the next two (at 4k).

reftable is more efficient in seek_hot at finding and parsing a
reference. For multi-level indexes both sp and mh implementations used
a lazy caching strategy that caches index blocks along the path to the
ref, but doesn't cache the final ref block. The tests amortized this
by performing 60,000 lookups on the same ref.

>> At least in this case, the 1k block size seems like a good tradeoff.

With the multi-level index now in reftable, it seems like reftable at
4k, 2 level index is a better tradeoff. It aligns with the most common
filesystem block size, and has lower seek times, both cold and hot.
Its slightly larger than Michael's alternative proposal (+921 K), but
compresses better than reftable at 1k (-1.5 M).


Re: reftable [v4]: new ref storage format

2017-08-03 Thread Shawn Pearce
On Thu, Aug 3, 2017 at 3:48 PM, Michael Haggerty  wrote:
> I've revised the blockless reftable proposal to address some feedback:

I've been thinking more about your blockless proposal.

I experimentally modified my reftable implementation to omit padding
between blocks, bringing it a tiny bit closer to your blockless
proposal. Unfortunately this slightly increased latency for lookups on
a 4k chunk size. I speculate this is because chunks are no longer
aligned with the filesystem page, and I'm forcing Mac OS X to give me
two pages out of the filesystem. Using padding to align to a 4k block
is slightly faster, and the average wasted per block is <=20 bytes,
too small to fit another ref.

The restart table and binary search within the 4k block is a
performance win. Disabling the restart table significantly increased
lookup latency.

tl;dr:  I think the block alignment and restart table are wins vs. the
multi-level index.


A suggested downside of my reftable design is the ref index at 4k
block size for 866k refs is 199 KiB, and must be paged in for binary
search to locate the correct block for any lookup. The pack idx for
the two main packs in this repository is 210 MiB. We think fairly
little of mmap'ing 210 MiB to perform binary search to find object
data. 199 KiB for ref data seems to be a bargain. An advantage of the
single level index is its only one page touched after the index is
loaded.

Hot reftable reads (5.6 usec) are faster than loose ref reads (6.5
usec). Once the ref index is loaded, reftable can read a ref more
quickly than the time required to open-read-close a loose ref.
Admittedly, a large index slows down a cold read.

tl;dr:  I just don't think the size of the index is a concern.


I really favor the reflog data in a different section from the ref
values themselves. Even for smaller transaction files, it improves
scan and lookup time by allowing readers who just care about the name
and SHA-1 value of a ref to not be paging in or skipping over log
record payloads. However, I also agree that the aggregates may benefit
from ref and log being separate files.


> * Add a way to mark a reflog entry "deleted" without having to rewrite
> everything. This is mostly meant to deal with `refs/stash`.

This is an interesting idea. Given how I implemented reftable in JGit,
just inserting a deletion record for the same (ref,update_index) tuple
would make it trivial to hide the prior entry.

> * Define an extension mechanism.
> * Define the SHA-1 → object mapping as an extension rather than as
> part of the main spec. My gut feeling is that it will never be
> implemented for git-core.

While the SHA-1 -> object mapping may never be implemented for
git-core, I'd still prefer to see it as an optional part of the file
specification, rather than an extension that is specified. IMHO the
extension stuff in DIRC has made it unnecessarily complicated, and
we've still revved that file through many revisions.

> * Revise how the SHA-1 → object mapping works:
> * Merge the bottommost OBJ_INDEX node with the old OBJ nodes to
> form a new OBJ_LEAF node.
> * Allow the branching factor of each node to be specified
> independently (to allow the node sizes to be matched more closely to
> the preferred read sizes).

I'm not sure objects warrant this kind of complexity. The obj support
in reftable is nearly identical to the ref support. I have a
significant amount of code that is common between them. Your approach
has objects different enough from refs that they need their own code,
increasing complexity in both the writer and reader.


> I currently lean towards the opinion that we should store pseudorefs
> (like `FETCH_HEAD`, `MERGE_HEAD` *outside of* reftables, except for
> `HEAD` (which behaves more like a normal reference, which is
> considered for reachability, and for which we want to retain reflogs),
> which we should store *in* reftables.

I'm on the fence, and don't really have a strong opinion about where
we store the pseudorefs. Happy to keep them in $GIT_DIR, happy to have
them supported inside a reftable.


Re: reftable [v4]: new ref storage format

2017-08-03 Thread Michael Haggerty
I've revised the blockless reftable proposal to address some feedback:

* Don't omit `prefix_len` for the first ref and first child in a
block. It doesn't save much but makes the reader more complicated.
* Get rid of `symref_target` (the backlink from a reference to the
symlink(s) that point at it). It is a solution to a problem that is
not worth solving.
* Get rid of the `SYMREF_PEELED` type. It is impossible to maintain
this field affordably without `symref_target`, but it seems fragile
and unnecessary anyway.
* Add a way to mark a reflog entry "deleted" without having to rewrite
everything. This is mostly meant to deal with `refs/stash`.
* Define an extension mechanism.
* Define the SHA-1 → object mapping as an extension rather than as
part of the main spec. My gut feeling is that it will never be
implemented for git-core.
* Revise how the SHA-1 → object mapping works:
* Merge the bottommost OBJ_INDEX node with the old OBJ nodes to
form a new OBJ_LEAF node.
* Allow the branching factor of each node to be specified
independently (to allow the node sizes to be matched more closely to
the preferred read sizes).
* Rename some fields for clarity.

I currently lean towards the opinion that we should store pseudorefs
(like `FETCH_HEAD`, `MERGE_HEAD` *outside of* reftables, except for
`HEAD` (which behaves more like a normal reference, which is
considered for reachability, and for which we want to retain reflogs),
which we should store *in* reftables.

I don't talk at all about how individual reftable files are stacked
together, because that is pretty much unchanged from Shawn's proposal.
The only thing I would suggest is to give the reftable files names
that indicate whether they include reference values, reflogs, or both,
so that readers can tell from the table of contents which files they
have to open.

The new proposal follows:

Definitions:

Varint encoding is similar to the unsigned varint encoding used for
protocol buffers; i.e., a little-endian base-128 format where the most
significant bit is used to indicate that another byte is coming.
Decoding is as follows:

```
val = *ptr & 0x7f
while (*ptr++ & 0x80) {
val = (val << 7) | (*ptr & 0x7f)
}
```

Strings are usually encoded using `varstr` encoding, which is the same
as how strings are encoded in protocol buffers. A `varstr` consists of
a `varint` length followed by the specified number of bytes holding
the contents of the string. `varstr` strings are not intrinsically
NUL-terminated.

The `chunk_length` fields encode the total size of the containing
chunk, including its `chunk_type` field.

```
varstr(s) = {
varint(strlen(s))
s
}
```

```
reftable = {
header
[chunk | padding]*
footer
}

header = {
'REFT'
uint8( version_number = 1 )

# The length of the whole header, in bytes:
header_length : uint16

metadata
}

footer = {
'REFT'
uint8( version_number = 1 )
metadata

# The length of the whole footer, in bytes:
footer_length : uint16

uint32(CRC-32 of previous part of footer)
}

metadata = {
# if `contains_values` is false, this file *must not* contain any
# reference values; i.e., `value_type` must always be `NO_VALUE`.
contains_values : bool

# if `contains_logs` is false, this file *must not* contain any
# reflog entries; i.e., `log_type` must always be `NO_REFLOG`.

contains_logs : bool

# If `offsets_negative` is true, then all `*_offset` fields point
# backwards in the file; i.e., the corresponding varint value is
# negated before use.
offsets_negative : bool

# update_index is a counter that is incremented by one for each
# atomic reference update affecting a stack of reftables (even if
# the reference update changes multiple references). Counting
# starts at 1 (0 is used as a special value). The following two
# fields indicate that this reftable file covers update_indexes in
# the range
#
# min_update_index < update_index <= max_update_index
#
# , with the possible exception of LOG_DELETED entries.

max_update_index : uint64
min_update_index : uint64

# To accomodate systems that have to write files serially, the
# following two entries can be zeroed out in the header to tell
# the reader that it has to read the corresponding values from the
# footer.

# The file offset of the chunk containing the root of the
# reference tree:
ref_root_chunk_addr : uint64

# This space can be used to add extensions to the file format. It
# is included in `header_length` / `footer_length` to allo older
# readers to skip over any extensions that it doesn't understand.
extension*
}

extension = {
# The name of the extension:
extension_name : varstr

# The address of the first block of data associated with the extension:
extension_root_chunk_addr : uint64
}

chunk = {
# The `chunk_type` determines how to interpret the payload, and
# 

Re: reftable [v4]: new ref storage format

2017-08-03 Thread Shawn Pearce
On Thu, Aug 3, 2017 at 11:38 AM, Michael Haggerty  wrote:
> On Tue, Aug 1, 2017 at 7:38 PM, Shawn Pearce  wrote:
>> On Tue, Aug 1, 2017 at 6:51 PM, Michael Haggerty  
>> wrote:
>>> On Tue, Aug 1, 2017 at 4:27 PM, Shawn Pearce  wrote:
 On Mon, Jul 31, 2017 at 11:41 PM, Michael Haggerty  
 wrote:
> [...]
 A couple of other notes about your contrasting design:

...
>>> OBJS blocks can also be
>>> unbounded in size if very many references point at the same object,
>>> thought that is perhaps only a theoretical problem.
>>
>> Gah, I missed that in reftable. The block id pointer list could cause
>> a single object id to exceed what fits in a block, and that will cause
>> the writer to fail unless its caller sets the block size larger. I
>> basically assumed this overflow condition is very unlikely, as its not
>> common to have a huge number of refs pointing to the same object.
>
> Given what Peff pointed out, let's just leave this as a varint for OBJS 
> blocks.

We discussed this at $DAY_JOB yesterday. We realized that if an obj
block has that many ref pointers present, it may be more efficient for
a reader to scan all references instead of chasing those pointers
individually. Latest draft of reftable now omits the ref pointer list
in an obj block if it exceeds the obj block size, which only occurs
when a high proportion of the ref blocks contain that SHA-1.


Re: reftable [v4]: new ref storage format

2017-08-03 Thread Shawn Pearce
On Wed, Aug 2, 2017 at 1:28 PM, Jeff King  wrote:
> On Wed, Aug 02, 2017 at 12:50:39PM -0700, Junio C Hamano wrote:
>
>> With the traditional "packed-refs plus loose" layout, no matter how
>> many times a handful of selected busy refs are updated during the
>> day, you'd need to open at most two files to find out the current
>> value of a single ref (admittedly, the accessing of the second file,
>> after we realize that there is no loose one, would be very costly).
>> If you make a few commits on a topic branch A, then build a 100
>> commit series on top of another topic branch B, finding the current
>> value of A is still one open and read of refs/heads/A.
>>
>> With the reftable format, we'd need to open and read all 100
>> incremental transactions that touch branch B before realizing that
>> none of them talk about A, and read the next transaction file to
>> find the current value of A.  To keep this number low, we'd need
>> quite a frequent compaction.
>
> I think this is where compaction cleverness can come in.
>
> One relatively easy optimization would be to note when the most recent
> reftable contains a subset of the refs we are currently updating (and
> the obvious string-of-updates to a single ref falls under that), and do
> a "quick" compaction where we simply drop[1] that reftable in favor of
> ours. That compaction is essentially free, because we know those entries
> aren't valid anymore anyway.
>
> I'm actually not sure if this is a strict "drop", though, because of
> reflogs. If the reflog is written into the same file as the ref update,
> then you'd need to roll its entry into your new update, too. But see
> below anyway.
>
> For more complicated cases, there's some cleverness you can do with
> small on-the-fly compactions. Even if there are entries in the last few
> reftables that we're not currently updating, it's pretty cheap to roll a
> few of them up into our new reftable if it lets us drop some
> intermediate reftables. E.g., if we're writing a new reftable with a 4K
> block size but only have 100 bytes of new data, we're probably best to
> roll up a previous 500-byte reftable.
>
> That one's an obvious optimization because we know that the filesystem
> is going to make us spend 4K either way, so rounding up to that is
> generally free-ish.

Yes. I was trying to propose exactly this in the first draft of
reftable, but someone on list didn't like the idea of an update
transaction stalling to perform a compaction, so I took that text out
in later drafts.

What I had envisioned is exactly what you mention; an update of
refs/heads/B that is just going to overlay another small reftable that
also recently updated refs/heads/B should just replace that table
instead of pushing onto the stack. Or if the combined top of stack +
new table is under 4K, they should just combine together instead of
pushing a new table onto the stack.


> What's less obvious is when we should roll up a bunch of 4K tables into
> one (let's say) 32K table.  I think there's a formula for doing this
> geometrically so that the amortized cost of writing stays under a
> certain bound (linear?). But I haven't thought it through (or looked it
> up); I was hoping Shawn had already done so and could dump that wisdom
> on us.

Sorry, I haven't done that. :)


>> We can just declare that reftable format is not for desktop clients
>> but for server implementations where frequent compaction would not
>> be an annoyance to the users, but I'd wish we do not have to.
>
> Yeah, I agree we should avoid that. I would love to eventually kill off
> the loose/packed backend (or at least make it historical and read-only,
> so that a reasonable response to people complaining about its races and
> lack of atomicity is "please upgrade to reftables").

I'd also really like reftable to be useful for "desktop clients". I'd
consider it something of a design failure if the format was horribly
unsuitable in some way to that use case. I think the small compactions
during updates will be essential to supporting the typical developer's
workflow.


Re: reftable [v4]: new ref storage format

2017-08-03 Thread Michael Haggerty
On Tue, Aug 1, 2017 at 7:38 PM, Shawn Pearce  wrote:
> On Tue, Aug 1, 2017 at 6:51 PM, Michael Haggerty  wrote:
>> On Tue, Aug 1, 2017 at 4:27 PM, Shawn Pearce  wrote:
>>> On Mon, Jul 31, 2017 at 11:41 PM, Michael Haggerty  
>>> wrote:
 [...]
>>> A couple of other notes about your contrasting design:
>>>
 elif chunk_type == INDEX {
 chunk_length : varint
>>>
>>> Using a varint for the chunk length made for a complicated reader.
>>> JGit doesn't have the luxury of mmap to access the file, so we have to
>>> allocate a byte[] and read data from a file descriptor to do anything
>>> fancy like decoding a varint. For my experiment I wound up just
>>> hardcoding the IO to read 1k or 4k from whatever address.
>>>
>>> A "real" implementation would likely prefer to read a fixed width
>>> field here such that chunks have a 3 byte header (1 byte chunk_type, 2
>>> byte chunk_length), and then issue a second read to acquire the rest
>>> of the chunk. Given that encoding a chunk length of 1024 or 4096 both
>>> requires 2 bytes of varint, its always going to be 2 bytes in your
>>> design anyway. With the way chunks are scanned, I don't think you want
>>> chunks as large as 16k, which would have caused the varint to go to 3
>>> bytes (but still fits in a fixed 2-byte chunk_length).
>>
>> That's a good point for INDEX and OBJS_INDEX blocks. Though for REFS
>> blocks that include reflogs, the block size has to be large enough to
>> hold the whole reflog for a reference, which can be arbitrarily large.
>> (Maybe this is a weakness of the design?)
>
> Gah. I missed that part about the whole reflog needing to fit in the
> same chunk when I wrote the quoted text above. That is a pretty big
> downside for very busy refs.
>
> Even when you isolate logs into their own file, the reflog for a busy
> ref could be huge. A naive reader would want to "page in" the entire
> chunk_length before parsing. That isn't tenable if the reflog for a
> busy ref was say 50 MiB. It complicates the reader more. My reftable
> proposal deals with this by breaking the log up with its special key
> structure.
>
> Requiring the entire reflog of a single ref to fit into a single chunk
> does seem to have its downsides.

I was assuming that readers would uncompress the data streamily, in
which case I don't think that it would be much of a problem: reflogs
are usually read either in their entirety, or just the most recent few
entries are read, either of which could be done efficiently despite
the whole reflog being in a single zlib-compressed blob. If streamy
reading is thought to be too complicated for readers, it wouldn't be a
big deal to add a `log_type` of `REFLOG_COMPRESSED_SEGMENTED` and put
the entries into multiple, smaller zlib-compressed chunks.

>> OBJS blocks can also be
>> unbounded in size if very many references point at the same object,
>> thought that is perhaps only a theoretical problem.
>
> Gah, I missed that in reftable. The block id pointer list could cause
> a single object id to exceed what fits in a block, and that will cause
> the writer to fail unless its caller sets the block size larger. I
> basically assumed this overflow condition is very unlikely, as its not
> common to have a huge number of refs pointing to the same object.

Given what Peff pointed out, let's just leave this as a varint for OBJS blocks.

 elif chunk_type == OBJS_INDEX {
 chunk_length : varint

 # The offset, relative to the start of this chunk, of the
 # chunk containing the next level of the obj index, for each
 # of the possible "next" bytes in the SHA-1, or zero if there
 # are no references with the given next byte.
 child_offset : varint * 256
>>>
>>> This is space saving and cute, but kind of annoying. If it was fixed
>>> width 32 bit you can address up to 4G away from this chunk's address,
>>> and you can directly jump to the byte of interest. By being varints
>>> you do save a little space, as most files will probably only need 3
>>> byte varints, and the 0s do collapse down to 1 byte, but you have to
>>> linearly walk the list to find any specific byte.
>>
>> The reason I used varint here was mostly because I expect the lowest
>> level of the OBJS_INDEX to be placed close to the OBJS chunks that it
>> refers to; hopefully within a 2-byte varint. Let me consider more
>> carefully whether that is realistic...
>
> Ah, ok. So you were expecting the writer to interleave OBJS and
> OBJS_INDEX chunks, with the OBJS_INDEX appearing every ~256 OBJS
> chunks. And then storing the next level of OBJS_INDEX consecutively.
>
> I think your math is still wrong about the lowest level OBJS_INDEX
> needing only 2-byte varint for child_offset. Given a 1k chunk size,
> you still have to address backwards about 256k, which requires a
> 3-byte varint.

I think that you would want 

Re: reftable [v4]: new ref storage format

2017-08-02 Thread Junio C Hamano
Shawn Pearce  writes:

> On Wed, Aug 2, 2017 at 6:50 PM, Junio C Hamano  wrote:
> ...
>>Would it benefit us if we define the sort order of bytes slightly
>>different from the ASCII order, so that a slash '/' sorts between
>>NUL '\000' and SOH '\001', which is the order we should have used
>>when storing the entries in the index?
>
> I'm not really with that. It complicates the compare routine, and
> makes the data in reftable sorted differently than we announce in the
> wire protocol.

Fair enough.  It was not like I had some operations in mind that
would benefit from such a sort order (i.e. walking two of these
things in parallel and merging them, which would have been the case
for the index when we walk it together with one or more trees); if
there is no such operation that benefit, there is no reason to try
to be clever here.

>>  * Even though readers can continue accessing, starting from the
>>$GIT_DIR/refs, without locking and get consistent views, any
>>transaction that groups one or more ref updates would need to
>>take a global lock on $GIT_DIR/refs file.
>>
>>Would it become a problem in a busy repository?
> ...
>
> I'm not really sure how one could safely do the same thing with
> git-core. Its certainly not going to be portable or safe on NFS if we
> tried to do anything fancy with flock(2), fcntl(F_SETLKW), or
> semop(2).

Yes.

And for public record, another thing we privately discussed was that
we currently do not know if we would want to make this design mesh
well with the use of multiple worktrees (IIUC, HEAD and things
outside refs/, and refs/bisect/, need to be per- worktree, while
others are to be common), and if so how.


Re: reftable [v4]: new ref storage format

2017-08-02 Thread Shawn Pearce
On Wed, Aug 2, 2017 at 6:50 PM, Junio C Hamano  wrote:
> Junio C Hamano  writes:
>
>> I like the general idea, what the file format can represent and how
>> it does so, but I am a bit uneasy about how well this "stacked" part
>> would work for desktop clients.
>
> Two more random things before I forget.
>
>  * I understand that you would want to allow both a ref "ref/D" and
>"ref/D/F" to appear in the same reftable file.  A refname is an
>uninterpreted sequence of bytes and refnames are sorted in the
>table.
>
>Would it benefit us if we define the sort order of bytes slightly
>different from the ASCII order, so that a slash '/' sorts between
>NUL '\000' and SOH '\001', which is the order we should have used
>when storing the entries in the index?

I'm not really with that. It complicates the compare routine, and
makes the data in reftable sorted differently than we announce in the
wire protocol. That cuts off any sort of optimizations we were
considering at $DAY_JOB to plumb the wire protocol code in JGit closer
to reftable code so that a large advertisement is more or less just
dumped straight from reftable to the wire protocol with only minimal
formatting changes.

>  * Even though readers can continue accessing, starting from the
>$GIT_DIR/refs, without locking and get consistent views, any
>transaction that groups one or more ref updates would need to
>take a global lock on $GIT_DIR/refs file.
>
>Would it become a problem in a busy repository?

Potentially yes. Writers may need to use a randomized exponential
backoff while trying to acquire the lock. For big application servers,
I recommended wrapping the file IO locking code with an application
server managed lock + fair wait queue. This is fairly straightforward
for JGit to implement within Gerrit Code Review.

I'm not really sure how one could safely do the same thing with
git-core. Its certainly not going to be portable or safe on NFS if we
tried to do anything fancy with flock(2), fcntl(F_SETLKW), or
semop(2).


Re: reftable [v4]: new ref storage format

2017-08-02 Thread Junio C Hamano
Junio C Hamano  writes:

> I like the general idea, what the file format can represent and how
> it does so, but I am a bit uneasy about how well this "stacked" part
> would work for desktop clients.

Two more random things before I forget.

 * I understand that you would want to allow both a ref "ref/D" and
   "ref/D/F" to appear in the same reftable file.  A refname is an
   uninterpreted sequence of bytes and refnames are sorted in the
   table.

   Would it benefit us if we define the sort order of bytes slightly
   different from the ASCII order, so that a slash '/' sorts between
   NUL '\000' and SOH '\001', which is the order we should have used
   when storing the entries in the index?

 * Even though readers can continue accessing, starting from the
   $GIT_DIR/refs, without locking and get consistent views, any
   transaction that groups one or more ref updates would need to
   take a global lock on $GIT_DIR/refs file.  

   Would it become a problem in a busy repository?


Re: reftable [v4]: new ref storage format

2017-08-02 Thread Jeff King
On Wed, Aug 02, 2017 at 12:50:39PM -0700, Junio C Hamano wrote:

> With the traditional "packed-refs plus loose" layout, no matter how
> many times a handful of selected busy refs are updated during the
> day, you'd need to open at most two files to find out the current
> value of a single ref (admittedly, the accessing of the second file,
> after we realize that there is no loose one, would be very costly).
> If you make a few commits on a topic branch A, then build a 100
> commit series on top of another topic branch B, finding the current
> value of A is still one open and read of refs/heads/A.
> 
> With the reftable format, we'd need to open and read all 100
> incremental transactions that touch branch B before realizing that
> none of them talk about A, and read the next transaction file to
> find the current value of A.  To keep this number low, we'd need
> quite a frequent compaction.

I think this is where compaction cleverness can come in.

One relatively easy optimization would be to note when the most recent
reftable contains a subset of the refs we are currently updating (and
the obvious string-of-updates to a single ref falls under that), and do
a "quick" compaction where we simply drop[1] that reftable in favor of
ours. That compaction is essentially free, because we know those entries
aren't valid anymore anyway.

I'm actually not sure if this is a strict "drop", though, because of
reflogs. If the reflog is written into the same file as the ref update,
then you'd need to roll its entry into your new update, too. But see
below anyway.

For more complicated cases, there's some cleverness you can do with
small on-the-fly compactions. Even if there are entries in the last few
reftables that we're not currently updating, it's pretty cheap to roll a
few of them up into our new reftable if it lets us drop some
intermediate reftables. E.g., if we're writing a new reftable with a 4K
block size but only have 100 bytes of new data, we're probably best to
roll up a previous 500-byte reftable.

That one's an obvious optimization because we know that the filesystem
is going to make us spend 4K either way, so rounding up to that is
generally free-ish.

What's less obvious is when we should roll up a bunch of 4K tables into
one (let's say) 32K table.  I think there's a formula for doing this
geometrically so that the amortized cost of writing stays under a
certain bound (linear?). But I haven't thought it through (or looked it
up); I was hoping Shawn had already done so and could dump that wisdom
on us.

> We can just declare that reftable format is not for desktop clients
> but for server implementations where frequent compaction would not
> be an annoyance to the users, but I'd wish we do not have to.

Yeah, I agree we should avoid that. I would love to eventually kill off
the loose/packed backend (or at least make it historical and read-only,
so that a reasonable response to people complaining about its races and
lack of atomicity is "please upgrade to reftables").

-Peff


Re: reftable [v4]: new ref storage format

2017-08-02 Thread Stefan Beller
> ### Ref block format
>
> A ref block is written as:
>
> 'r'
> uint24( block_len )
> ref_record+
> uint32( restart_offset )+
> uint16( restart_count )
> padding?
>

So I learned that your current writer is a two block pass,
i.e. the block is first written into memory and then once
the block looks complete it is written out to disk.

This would allow us to shuffle the data around during
the actual out-to-disk-phase, such as this:

  'r'
  uint24( restart_count )
  uint32( restart_offset )+
  ref_record+
  ref_record_endmarker
  padding?

(A) In nearby emails we discussed to have the restart offsets
to be 24 bit, but now they are 32-bit aligned to the start of a block
so we could keep them 32 bit for simplicity of reading.

(B) Note how there is no block_len encoding, which was originally
only needed to lookup the position of restart_count. (so even for that
we could rename it to padding_len, such that the position of
restart_count can be decoded easily)

We no longer need the block_len as the restart_count comes right
after the 'r'.

Instead we'll have a ref_record_endmarker that reads as a ref
with both prefix and suffix to '0', type deletion (such that there is
no further cost). The end marker would only need two '0's, which
makes it indistinguishable from padding.


Re: reftable [v4]: new ref storage format

2017-08-02 Thread Junio C Hamano
Shawn Pearce  writes:

> ### Layout
>
> The `$GIT_DIR/refs` path is a file when reftable is configured, not a
> directory.  This prevents loose references from being stored.
>
> A collection of reftable files are stored in the `$GIT_DIR/reftable/`
> directory:
>
> 0001_UF4paF
> 0002_bUVgy4
>
> where reftable files are named by a unique name such as produced by
> the function:
>
> mktemp "${update_index}_XX"
>
> The stack ordering file is `$GIT_DIR/refs` and lists the current
> files, one per line, in order, from oldest (base) to newest (most
> recent):
>
> $ cat .git/refs
> 0001_UF4paF
> 0002_bUVgy4
>
> Readers must read `$GIT_DIR/refs` to determine which files are
> relevant right now, and search through the stack in reverse order
> (last reftable is examined first).
>
> Reftable files not listed in `refs` may be new (and about to be added
> to the stack by the active writer), or ancient and ready to be pruned.

I like the general idea, what the file format can represent and how
it does so, but I am a bit uneasy about how well this "stacked" part
would work for desktop clients.  The structure presented here is for
optimizing the "we want to learn about many (or all) refs" access
pattern, which probably matters a lot on the server implementations,
but I do not feel comfortable without knowing how much it penalizes
"I want the current value of this single ref" access pattern.

With the traditional "packed-refs plus loose" layout, no matter how
many times a handful of selected busy refs are updated during the
day, you'd need to open at most two files to find out the current
value of a single ref (admittedly, the accessing of the second file,
after we realize that there is no loose one, would be very costly).
If you make a few commits on a topic branch A, then build a 100
commit series on top of another topic branch B, finding the current
value of A is still one open and read of refs/heads/A.

With the reftable format, we'd need to open and read all 100
incremental transactions that touch branch B before realizing that
none of them talk about A, and read the next transaction file to
find the current value of A.  To keep this number low, we'd need
quite a frequent compaction.

We can just declare that reftable format is not for desktop clients
but for server implementations where frequent compaction would not
be an annoyance to the users, but I'd wish we do not have to.





Re: reftable [v4]: new ref storage format

2017-08-02 Thread Jeff King
On Wed, Aug 02, 2017 at 08:17:29AM -0700, Shawn Pearce wrote:

> > Just peeking at torvalds/linux, we have some objects with ~35K refs
> > pointing to them (e.g., the v2.6.11 tag).
> 
> Oy. I'll bet that every occurrence winds up in its own block due to
> the layout of the namespace, and so the obj block list needs 35k
> varint pointers. That requires a larger block size if it has any
> chance of fitting into the reftable format.
> 
> Another option is disable the obj table for these shared-object repos.
> Its an optional part of the format and can be omitted if the reader
> isn't likely to need to lookup by SHA-1, or is willing to pay the
> brute force cost of scanning every ref.

Yeah, sorry, I meant to write a few more paragraphs. I think refusing to
generate the object table for these repos would be OK. We don't serve
any user-facing operations out of them directly[1].

I'm also open to the argument that they're simply insane. Most of the
time we don't need them to be a real repository at all. They could exist
as a bare "objects/" directory. It's only at repack time that we
actually need to know which objects are reachable[2], so we could
do a special repack that generates the list on the fly from the child
repositories.

-Peff

[1] We actually disable the ".have" advertisements, because the cost of
showing all of the shared-storage ref tips is too high. One thing
I'd like to do is be able to advertise a subset of the alternate
refs (if you're a fork of torvalds/linux, then share _just_ the refs
from there). But with the current ref code, I can't even ask for a
subset of the refs without paying the cost to walk all of them.
That's one of the things I'd like to build on top of the mmap'd
packed-refs solution (and naturally would work with reftables, too).

[2] It's a bit more complicated than just knowing the list of reachable
objects. We also want to know which ones are reachable from which
fork, as we do some trickery to avoid creating deltas across forks.
So we really do want the whole ref-list, and not something like a
de-duped set of reachable tips. I don't think that makes a
difference for anything we're discussing here, but just a bit of
trivia in case somebody is thinking about the shared-storage problem
space.


Re: reftable [v4]: new ref storage format

2017-08-02 Thread Jeff King
On Wed, Aug 02, 2017 at 08:20:44AM -0400, Dave Borowitz wrote:

> >> OTOH a mythical protocol v2 might reduce the need to scan the
> >> references for advertisement, so maybe this optimization will be more
> >> helpful in the future?
> 
> I haven't been following the status of the proposal, but I was
> assuming a client-speaks-first protocol would also imply the client
> asking for refnames, not SHA-1s, in which case lookup by SHA-1 is no
> longer relevant.

Good point. The hidden-refs thing Shawn described is a trick that would
be used because the current protocol is so lousy. It's not clear how a
stateless-rpc request would work, but in theory the follow-up requests
could also say "hey, I'm only interested in refs/{heads,tags}" and the
stateless server could limit is marking to that.

But that would still leave something like allowReachableSHA1InWant
having to look at all refs (and I don't see why a client requesting by
sha1 would give any limiting refspec at all). On the other hand, looking
at the ref tips would probably be dominated by actually traversing the
commits in most cases. Of course one could come up with a pathological
case pretty easily (tons of refs, and people asking for submodules at
tip commits).

So I do think there are cases where the optimization would help, but I
still not sure how much. If it's an optional bit of the design, we at
least have the option of just not generating if it turns out not to be
useful. My only concern would be if we make other protocol sacrifices or
complications to include it.

-Peff


Re: reftable [v4]: new ref storage format

2017-08-02 Thread Junio C Hamano
Shawn Pearce  writes:

> On Wed, Aug 2, 2017 at 2:28 AM, Jeff King  wrote:
>> On Tue, Aug 01, 2017 at 07:38:37PM -0700, Shawn Pearce wrote:
>>
>>> > OBJS blocks can also be
>>> > unbounded in size if very many references point at the same object,
>>> > thought that is perhaps only a theoretical problem.
>>>
>>> Gah, I missed that in reftable. The block id pointer list could cause
>>> a single object id to exceed what fits in a block, and that will cause
>>> the writer to fail unless its caller sets the block size larger. I
>>> basically assumed this overflow condition is very unlikely, as its not
>>> common to have a huge number of refs pointing to the same object.
>>
>> It's actually quite common for us, as we have big shared-object repos
>> that contain a copy of the refs of all of their child repos (for
>> reachability during packing, etc). So tags, where the value is the same
>> in each fork, you have one ref per fork pointing to it.
>>
>> Just peeking at torvalds/linux, we have some objects with ~35K refs
>> pointing to them (e.g., the v2.6.11 tag).
>
> Oy. I'll bet that every occurrence winds up in its own block due to
> the layout of the namespace, and so the obj block list needs 35k
> varint pointers. That requires a larger block size if it has any
> chance of fitting into the reftable format.
>
> Another option is disable the obj table for these shared-object repos.
> Its an optional part of the format and can be omitted if the reader
> isn't likely to need to lookup by SHA-1, or is willing to pay the
> brute force cost of scanning every ref.

I am wondering if we need the reverse look-up for a regular
repository that allows "fetch anything at the tip".  It only needs
"I got this request for an object name--does it sit at the tip of
any ref?  Yes/No".  It does not need to know exactly which ref
points at the asked object.

Yes, I know Gerrit has its own ACL that limits the visibility of
refs so the question becomes "does it sit at the tip of any ref that
is visible to the current user?".  An Yes/No answer for "any ref?"
may still give you a short-cut when rejecting, but you'd then need
to scan to give a positive answer without a full reverse mapping.

For the use of "consolidated object store for all forks", I do not
think it makes sense to even have "Does it sit at a tip, Yes/No"
information.  Even when Linus's repository and a fork share the same
object store, I do not think anybody expects to be able to fetch a
commit at the tip of a branch in the fork but not in Linus's
repository.



Re: reftable [v4]: new ref storage format

2017-08-02 Thread Shawn Pearce
On Wed, Aug 2, 2017 at 2:28 AM, Jeff King  wrote:
> On Tue, Aug 01, 2017 at 07:38:37PM -0700, Shawn Pearce wrote:
>
>> > OBJS blocks can also be
>> > unbounded in size if very many references point at the same object,
>> > thought that is perhaps only a theoretical problem.
>>
>> Gah, I missed that in reftable. The block id pointer list could cause
>> a single object id to exceed what fits in a block, and that will cause
>> the writer to fail unless its caller sets the block size larger. I
>> basically assumed this overflow condition is very unlikely, as its not
>> common to have a huge number of refs pointing to the same object.
>
> It's actually quite common for us, as we have big shared-object repos
> that contain a copy of the refs of all of their child repos (for
> reachability during packing, etc). So tags, where the value is the same
> in each fork, you have one ref per fork pointing to it.
>
> Just peeking at torvalds/linux, we have some objects with ~35K refs
> pointing to them (e.g., the v2.6.11 tag).

Oy. I'll bet that every occurrence winds up in its own block due to
the layout of the namespace, and so the obj block list needs 35k
varint pointers. That requires a larger block size if it has any
chance of fitting into the reftable format.

Another option is disable the obj table for these shared-object repos.
Its an optional part of the format and can be omitted if the reader
isn't likely to need to lookup by SHA-1, or is willing to pay the
brute force cost of scanning every ref.


Re: reftable [v4]: new ref storage format

2017-08-02 Thread Dave Borowitz
On Tue, Aug 1, 2017 at 10:38 PM, Shawn Pearce  wrote:
>> Peff and I discussed off-list whether the lookup-by-SHA-1 feature is
>> so important in the first place. Currently, all references must be
>> scanned for the advertisement anyway,
>
> Not really. You can hide refs and allow-tip-sha1 so clients can fetch
> a ref even if it wasn't in the advertisement. We really want to use
> that wire protocol capability with Gerrit Code Review to hide the
> refs/changes/ namespace from the advertisement, but allow clients to
> fetch any of those refs if they send its current SHA-1 in a want line
> anyway.
>
> So a server could scan only the refs/{heads,tags}/ prefixes for the
> advertisement, and then leverage the lookup-by-SHA1 to verify other
> SHA-1s sent by the client.
>
>> so avoiding a second scan to vet
>> SHA-1s received from the client is at best going to reduce the effort
>> by a constant factor. Do you have numbers showing that this
>> optimization is worth it?
>
> No, but I don't think I need to do much to prove it. My 866k ref
> example advertisement right now is >62 MiB. If we do what I'm
> suggesting in the paragraphs above, the advertisement is ~51 KiB.

That being said, our bias towards minimizing the number of ref scans
is rooted in our experience where scanning 866k refs takes 5 seconds
to get the response from the storage backend into the git server.
Cutting ref scans from 2 to 1 (or 1 to 0) is a big deal in that case.
But that 5s number is based on our current, slow storage, not on
reftable. If migrating to reftable turns each 5s scan into a 400ms
scan, we might be able to live with that, even if we don't have fast
lookup by SHA-1.

>> OTOH a mythical protocol v2 might reduce the need to scan the
>> references for advertisement, so maybe this optimization will be more
>> helpful in the future?

I haven't been following the status of the proposal, but I was
assuming a client-speaks-first protocol would also imply the client
asking for refnames, not SHA-1s, in which case lookup by SHA-1 is no
longer relevant.


Re: reftable [v4]: new ref storage format

2017-08-02 Thread Jeff King
On Tue, Aug 01, 2017 at 07:38:37PM -0700, Shawn Pearce wrote:

> > OBJS blocks can also be
> > unbounded in size if very many references point at the same object,
> > thought that is perhaps only a theoretical problem.
> 
> Gah, I missed that in reftable. The block id pointer list could cause
> a single object id to exceed what fits in a block, and that will cause
> the writer to fail unless its caller sets the block size larger. I
> basically assumed this overflow condition is very unlikely, as its not
> common to have a huge number of refs pointing to the same object.

It's actually quite common for us, as we have big shared-object repos
that contain a copy of the refs of all of their child repos (for
reachability during packing, etc). So tags, where the value is the same
in each fork, you have one ref per fork pointing to it.

Just peeking at torvalds/linux, we have some objects with ~35K refs
pointing to them (e.g., the v2.6.11 tag).

> > Peff and I discussed off-list whether the lookup-by-SHA-1 feature is
> > so important in the first place. Currently, all references must be
> > scanned for the advertisement anyway,
> 
> Not really. You can hide refs and allow-tip-sha1 so clients can fetch
> a ref even if it wasn't in the advertisement. We really want to use
> that wire protocol capability with Gerrit Code Review to hide the
> refs/changes/ namespace from the advertisement, but allow clients to
> fetch any of those refs if they send its current SHA-1 in a want line
> anyway.
> 
> So a server could scan only the refs/{heads,tags}/ prefixes for the
> advertisement, and then leverage the lookup-by-SHA1 to verify other
> SHA-1s sent by the client.

Yeah, that makes sense (though I hope in the end that strategy will go
away in favor of a better protocol, as getting the sha1 out-of-band has
obvious UX complexities).

> > OTOH a mythical protocol v2 might reduce the need to scan the
> > references for advertisement, so maybe this optimization will be more
> > helpful in the future?
> 
> Yes, I'm hopeful we can get a v2 protocol built on the work Jonathan
> Tan is doing, and switch the advertisement around to "client speaks
> first", so that we can be smarter on the server about which refs are
> read and sent. That is a long way off, lets say 3-5 years before its
> really common in clients.

I was actually planning to spend some time on this in the next month or
two. I don't think it needs to be that complicated. We don't need a
whole protocol revamp. We just need a way to get a few bits from the
client before the advertisement, and from there we can bootstrap any
more radical protocol changes we want.

I know it will take a while before it's something we can expect in
clients, but it's definitely worth planning around. And sometimes a
feature like this can drive upgrades, if it's something that produces an
immediate and obvious benefit to the client.

> Servers today don't update HEAD reflog when a branch is pushed. I
> think trying to record that is overkill, you have the reflog data in
> the ref itself that the client sent the command to modify.

I think they do, at least for C git:

  $ git init --bare dst.git
  $ git -C dst.git config core.logallrefupdates
  $ git push dst.git
  ...
  To dst.git
   * [new branch]  master -> master
  $ find dst.git/logs -type f | xargs wc -l
1 dst.git/logs/refs/heads/master
1 dst.git/logs/HEAD

The special logic for "see if we're updating the ref that HEAD points
to" is deep in the ref machinery, so it gets triggered for all updates,
including pushes.

I agree it's not actually that interesting for a bare repo, where HEAD
isn't that meaningful (and doesn't tend to change a lot anyway).

> > That's what I was thinking. But I've yet to hear anybody complain
> > about missing reflogs for symrefs if the underlying reference is
> > updated directly, so maybe we should just leave that "problem"
> > unsolved. It is certainly simpler and less brittle not to have to keep
> > backreferences like these in sync with the forward references.
> 
> Yup, that is my take on it, and why I didn't try to put this into
> reftable drafts, even though it was discussed between us on the list
> in earlier messages.

Yeah, I'd agree. It might be worth doing a better job of showing the
before/after destinations in the reflog when updating a symbolic ref,
which would let you reconstruct the state from the pointed-to reflogs if
you cared to. But that's orthogonal to the storage format (you can do it
already if you bother to pass a good message to "symbolic-ref -m").

-Peff


Re: reftable [v4]: new ref storage format

2017-08-01 Thread Shawn Pearce
On Tue, Aug 1, 2017 at 6:51 PM, Michael Haggerty  wrote:
> On Tue, Aug 1, 2017 at 4:27 PM, Shawn Pearce  wrote:
>> On Mon, Jul 31, 2017 at 11:41 PM, Michael Haggerty  
>> wrote:
>>> On Sun, Jul 30, 2017 at 8:51 PM, Shawn Pearce  wrote:
 4th iteration of the reftable storage format.
 [...]
>>>
>>> Before we commit to Shawn's reftable proposal, I wanted to explore
>>> what a contrasting design that is not block based would look like.
>>
>> I forgot to look at a 1k chunk size, as you suggested that might also
>> be suitable. Here is the more complete experiment table:
>>
>>| size   | seek_cold | seek_hot  |
>> mh  1k | 29.4 M | 20.6 usec | 10.7 usec |   <== Row fixed
>> mh  4k | 28.3 M | 24.5 usec | 14.5 usec |
>> sp  4k | 29.2 M | 63.0 usec |  5.8 usec |
>> sp 64k | 27.7 M | 35.6 usec | 23.3 usec |
>
> At least in this case, the 1k block size seems like a good tradeoff.
> Did 1k vs 4k change the number of index levels required?

Yes. 1k needed 4 levels of index; 4k needed 2 levels.
More chunks, with smaller index chunks, forces a deeper index depth.


>> A couple of other notes about your contrasting design:
>>
>>> elif chunk_type == INDEX {
>>> chunk_length : varint
>>
>> Using a varint for the chunk length made for a complicated reader.
>> JGit doesn't have the luxury of mmap to access the file, so we have to
>> allocate a byte[] and read data from a file descriptor to do anything
>> fancy like decoding a varint. For my experiment I wound up just
>> hardcoding the IO to read 1k or 4k from whatever address.
>>
>> A "real" implementation would likely prefer to read a fixed width
>> field here such that chunks have a 3 byte header (1 byte chunk_type, 2
>> byte chunk_length), and then issue a second read to acquire the rest
>> of the chunk. Given that encoding a chunk length of 1024 or 4096 both
>> requires 2 bytes of varint, its always going to be 2 bytes in your
>> design anyway. With the way chunks are scanned, I don't think you want
>> chunks as large as 16k, which would have caused the varint to go to 3
>> bytes (but still fits in a fixed 2-byte chunk_length).
>
> That's a good point for INDEX and OBJS_INDEX blocks. Though for REFS
> blocks that include reflogs, the block size has to be large enough to
> hold the whole reflog for a reference, which can be arbitrarily large.
> (Maybe this is a weakness of the design?)

Gah. I missed that part about the whole reflog needing to fit in the
same chunk when I wrote the quoted text above. That is a pretty big
downside for very busy refs.

Even when you isolate logs into their own file, the reflog for a busy
ref could be huge. A naive reader would want to "page in" the entire
chunk_length before parsing. That isn't tenable if the reflog for a
busy ref was say 50 MiB. It complicates the reader more. My reftable
proposal deals with this by breaking the log up with its special key
structure.

Requiring the entire reflog of a single ref to fit into a single chunk
does seem to have its downsides.


> OBJS blocks can also be
> unbounded in size if very many references point at the same object,
> thought that is perhaps only a theoretical problem.

Gah, I missed that in reftable. The block id pointer list could cause
a single object id to exceed what fits in a block, and that will cause
the writer to fail unless its caller sets the block size larger. I
basically assumed this overflow condition is very unlikely, as its not
common to have a huge number of refs pointing to the same object.


>>> elif chunk_type == OBJS_INDEX {
>>> chunk_length : varint
>>>
>>> # The offset, relative to the start of this chunk, of the
>>> # chunk containing the next level of the obj index, for each
>>> # of the possible "next" bytes in the SHA-1, or zero if there
>>> # are no references with the given next byte.
>>> child_offset : varint * 256
>>
>> This is space saving and cute, but kind of annoying. If it was fixed
>> width 32 bit you can address up to 4G away from this chunk's address,
>> and you can directly jump to the byte of interest. By being varints
>> you do save a little space, as most files will probably only need 3
>> byte varints, and the 0s do collapse down to 1 byte, but you have to
>> linearly walk the list to find any specific byte.
>
> The reason I used varint here was mostly because I expect the lowest
> level of the OBJS_INDEX to be placed close to the OBJS chunks that it
> refers to; hopefully within a 2-byte varint. Let me consider more
> carefully whether that is realistic...

Ah, ok. So you were expecting the writer to interleave OBJS and
OBJS_INDEX chunks, with the OBJS_INDEX appearing every ~256 OBJS
chunks. And then storing the next level of OBJS_INDEX consecutively.

I think your math is still wrong about the lowest level OBJS_INDEX
needing only 2-byte varint for child_offset. Given a 1k chunk 

Re: reftable [v4]: new ref storage format

2017-08-01 Thread Michael Haggerty
On Tue, Aug 1, 2017 at 4:27 PM, Shawn Pearce  wrote:
> On Mon, Jul 31, 2017 at 11:41 PM, Michael Haggerty  
> wrote:
>> On Sun, Jul 30, 2017 at 8:51 PM, Shawn Pearce  wrote:
>>> 4th iteration of the reftable storage format.
>>> [...]
>>
>> Before we commit to Shawn's reftable proposal, I wanted to explore
>> what a contrasting design that is not block based would look like.
>
> I forgot to look at a 1k chunk size, as you suggested that might also
> be suitable. Here is the more complete experiment table:
>
>| size   | seek_cold | seek_hot  |
> mh  1k | 29.4 M | 20.6 usec | 10.7 usec |   <== Row fixed
> mh  4k | 28.3 M | 24.5 usec | 14.5 usec |
> sp  4k | 29.2 M | 63.0 usec |  5.8 usec |
> sp 64k | 27.7 M | 35.6 usec | 23.3 usec |

At least in this case, the 1k block size seems like a good tradeoff.
Did 1k vs 4k change the number of index levels required?

> A couple of other notes about your contrasting design:
>
>> elif chunk_type == INDEX {
>> chunk_length : varint
>
> Using a varint for the chunk length made for a complicated reader.
> JGit doesn't have the luxury of mmap to access the file, so we have to
> allocate a byte[] and read data from a file descriptor to do anything
> fancy like decoding a varint. For my experiment I wound up just
> hardcoding the IO to read 1k or 4k from whatever address.
>
> A "real" implementation would likely prefer to read a fixed width
> field here such that chunks have a 3 byte header (1 byte chunk_type, 2
> byte chunk_length), and then issue a second read to acquire the rest
> of the chunk. Given that encoding a chunk length of 1024 or 4096 both
> requires 2 bytes of varint, its always going to be 2 bytes in your
> design anyway. With the way chunks are scanned, I don't think you want
> chunks as large as 16k, which would have caused the varint to go to 3
> bytes (but still fits in a fixed 2-byte chunk_length).

That's a good point for INDEX and OBJS_INDEX blocks. Though for REFS
blocks that include reflogs, the block size has to be large enough to
hold the whole reflog for a reference, which can be arbitrarily large.
(Maybe this is a weakness of the design?) OBJS blocks can also be
unbounded in size if very many references point at the same object,
thought that is perhaps only a theoretical problem.

> My reftable proposal should still do well in a mmap region. Most of
> the cold start penalty for reftable is JGit copying the ref index from
> the file descriptor to the memory block where we can parse the format.
> That is why the cold_seek time declines for a larger block size, the
> index is smaller.
>
>
>> first_child : {
>> refname : varstr
>> index_payload
>> }
>> other_children : {
>> # Length of prefix being carried over from the previous
>> # record:
>> prefix_len : varint
>> suffix : varstr
>> index_payload
>
> Having no prefix_len on first_child made for a slightly funkier
> parser. It does save you a byte, but the parser has to know if its
> looking at the first child, or an other_children to know if it should
> expect the prefix_len. Its a simple condition, but it kind of grated
> on me when I wrote that particular section of the experiment. For the
> majority of records the parser considers, the prefix_len is always
> present.
>
> That is why I proposed the restart_offsets point to the prefix_len,
> and prefix_len = 0 at restart points. It slightly simplified the
> parser.

Yes, that seems like a reasonable compromise.

>> elif chunk_type == OBJS_INDEX {
>> chunk_length : varint
>>
>> # The offset, relative to the start of this chunk, of the
>> # chunk containing the next level of the obj index, for each
>> # of the possible "next" bytes in the SHA-1, or zero if there
>> # are no references with the given next byte.
>> child_offset : varint * 256
>
> This is space saving and cute, but kind of annoying. If it was fixed
> width 32 bit you can address up to 4G away from this chunk's address,
> and you can directly jump to the byte of interest. By being varints
> you do save a little space, as most files will probably only need 3
> byte varints, and the 0s do collapse down to 1 byte, but you have to
> linearly walk the list to find any specific byte.

The reason I used varint here was mostly because I expect the lowest
level of the OBJS_INDEX to be placed close to the OBJS chunks that it
refers to; hopefully within a 2-byte varint. Let me consider more
carefully whether that is realistic...

An OBJS_INDEX should be well under 1kB in size. The lowest-level
OBJS_INDEX will presumably be relatively but not totally full
(otherwise you'd add another OBJS_INDEX level); suppose 1/2 of its
entries are filled. Most of those entries should contain only a single
SHA-1.

An obj_record will typically be 3 bytes plus the size of one

Re: reftable [v4]: new ref storage format

2017-08-01 Thread Michael Haggerty
On Tue, Aug 1, 2017 at 1:23 PM, Shawn Pearce  wrote:
> On Mon, Jul 31, 2017 at 11:41 PM, Michael Haggerty  
> wrote:
>> On Sun, Jul 30, 2017 at 8:51 PM, Shawn Pearce  wrote:
>>> 4th iteration of the reftable storage format.
>>> [...]
>>
>> Before we commit to Shawn's reftable proposal, I wanted to explore
>> what a contrasting design that is not block based would look like. So
>> I threw together a sketch of such a design. It is not as refined as
>> Shawn's, and I haven't had time to program an implementation or
>> collect statistics, but maybe it is interesting anyway.
>
> Thanks for taking the time to write this out. Its constructive to see
> other possible approaches.
>
> I roughly implemented your proposed design in JGit for references
> only. I skipped the object lookup and log handling for the sake of
> studying the basics of this approach.

Wow, that's awesome, thanks!

>| size   | seek_cold | seek_hot  |
> mh | 28.3 M | 24.5 usec | 14.5 usec |
> sp  4k | 29.2 M | 63.0 usec |  5.8 usec |
> sp 64k | 27.7 M | 35.6 usec | 23.3 usec |

OK, so it's at least in the right ballpark.

>> * Multiple related chunks can be stored in a 64 kiB block (interesting
>>   for people who prefer to read files in 64k segments). By storing
>>   related chunks near each other, the number of seeks will probably
>>   typically be smaller than the number of chunks that have to be
>>   accessed.
>
> This is difficult. The most related chunks are actually index chunks
> in the multi-level index. You want the next step(s) near the current
> step to avoid an actual disk seek. But that is hard to do when the
> index is several levels deep.

Given the 64k read size that you prefer, it seems to me that it would
probably be possible to keep pairs of index layers (i.e., one index
node and all of its direct children) in the same 64k range, though
this might require adjusting their sizes somewhat. And possibly to
keep the lowest index layer close to the reference chunks that it
describes (analogous to your format, where the restart records form
something like an index that is located close to the references). So I
would think that it is plausible to arrange things so that a reference
can be read within two 64k reads even if the index has three levels.

>> * The file can be written in two possible ways: with cross-references
>>   between chunks always pointing to later parts of the file (probably
>>   preferable if the data will be read from a local hard disk) or
>>   always pointing to earlier parts of the file (to accomodate writers
>>   who need to write their data in order). See the `offsets_negative`
>>   field in the header. (I'm not sure that the former variant is
>>   actually needed.)
>
> This seems like unnecessary complexity. I'd prefer to just say the
> offsets are negative, and reference backwards.

Maybe. I just wasn't sure whether a hard disk would perform
significantly worse when reading backwards rather than forwards
(because of pre-fetch of blocks). It could very well be unnecessary.

Michael


Re: reftable [v4]: new ref storage format

2017-08-01 Thread Shawn Pearce
On Tue, Aug 1, 2017 at 4:27 PM, Shawn Pearce  wrote:
> On Mon, Jul 31, 2017 at 11:41 PM, Michael Haggerty  
> wrote:
>> On Sun, Jul 30, 2017 at 8:51 PM, Shawn Pearce  wrote:
>>> 4th iteration of the reftable storage format.
>>> [...]
>>
>> Before we commit to Shawn's reftable proposal, I wanted to explore
>> what a contrasting design that is not block based would look like.
>
> I forgot to look at a 1k chunk size, as you suggested that might also
> be suitable. Here is the more complete experiment table:
>
>| size   | seek_cold | seek_hot  |
> mh  1k | 36.6 M | 20.6 usec | 10.7 usec |
> mh  4k | 28.3 M | 24.5 usec | 14.5 usec |
> sp  4k | 29.2 M | 63.0 usec |  5.8 usec |
> sp 64k | 27.7 M | 35.6 usec | 23.3 usec |

Argh. I got that mh 1k size wrong, its actually 29.4M (not 36.6M!).
Sorry for the noise.


Re: reftable [v4]: new ref storage format

2017-08-01 Thread Shawn Pearce
On Mon, Jul 31, 2017 at 11:41 PM, Michael Haggerty  wrote:
> On Sun, Jul 30, 2017 at 8:51 PM, Shawn Pearce  wrote:
>> 4th iteration of the reftable storage format.
>> [...]
>
> Before we commit to Shawn's reftable proposal, I wanted to explore
> what a contrasting design that is not block based would look like.

I forgot to look at a 1k chunk size, as you suggested that might also
be suitable. Here is the more complete experiment table:

   | size   | seek_cold | seek_hot  |
mh  1k | 36.6 M | 20.6 usec | 10.7 usec |
mh  4k | 28.3 M | 24.5 usec | 14.5 usec |
sp  4k | 29.2 M | 63.0 usec |  5.8 usec |
sp 64k | 27.7 M | 35.6 usec | 23.3 usec |


A couple of other notes about your contrasting design:

> elif chunk_type == INDEX {
> chunk_length : varint

Using a varint for the chunk length made for a complicated reader.
JGit doesn't have the luxury of mmap to access the file, so we have to
allocate a byte[] and read data from a file descriptor to do anything
fancy like decoding a varint. For my experiment I wound up just
hardcoding the IO to read 1k or 4k from whatever address.

A "real" implementation would likely prefer to read a fixed width
field here such that chunks have a 3 byte header (1 byte chunk_type, 2
byte chunk_length), and then issue a second read to acquire the rest
of the chunk. Given that encoding a chunk length of 1024 or 4096 both
requires 2 bytes of varint, its always going to be 2 bytes in your
design anyway. With the way chunks are scanned, I don't think you want
chunks as large as 16k, which would have caused the varint to go to 3
bytes (but still fits in a fixed 2-byte chunk_length).

My reftable proposal should still do well in a mmap region. Most of
the cold start penalty for reftable is JGit copying the ref index from
the file descriptor to the memory block where we can parse the format.
That is why the cold_seek time declines for a larger block size, the
index is smaller.


> first_child : {
> refname : varstr
> index_payload
> }
> other_children : {
> # Length of prefix being carried over from the previous
> # record:
> prefix_len : varint
> suffix : varstr
> index_payload

Having no prefix_len on first_child made for a slightly funkier
parser. It does save you a byte, but the parser has to know if its
looking at the first child, or an other_children to know if it should
expect the prefix_len. Its a simple condition, but it kind of grated
on me when I wrote that particular section of the experiment. For the
majority of records the parser considers, the prefix_len is always
present.

That is why I proposed the restart_offsets point to the prefix_len,
and prefix_len = 0 at restart points. It slightly simplified the
parser.


> elif chunk_type == OBJS_INDEX {
> chunk_length : varint
>
> # The offset, relative to the start of this chunk, of the
> # chunk containing the next level of the obj index, for each
> # of the possible "next" bytes in the SHA-1, or zero if there
> # are no references with the given next byte.
> child_offset : varint * 256

This is space saving and cute, but kind of annoying. If it was fixed
width 32 bit you can address up to 4G away from this chunk's address,
and you can directly jump to the byte of interest. By being varints
you do save a little space, as most files will probably only need 3
byte varints, and the 0s do collapse down to 1 byte, but you have to
linearly walk the list to find any specific byte.


> ref_payload = {
> value_type : enum NO_VALUE
> | DELETED
> | VALUE | VALUE_PEELED
> | SYMREF | SYMREF_PEELED
> | SPECIAL
> log_type : enum NO_REFLOG | REFLOG | REFLOG_COMPRESSED
> symref_target : bool

FWIW I didn't implement log_type or symref_target in my experiment, so
the size per ref was maybe a few bytes smaller than what you outlined
here.


> # This field is used to keep backwards links from references to
> # any symrefs that point at them, to make it tractable to update
> # the reflog of the symref if the reference is changed directly:
> if symref_target {
> referer : varstr
> varint(0)
> }

I wonder how desirable this feature is. Most updates are done through
HEAD, which is a symref and can therefore update both HEAD and the
target's reflogs in the same operation. It seems to me its rare to
issue an update directly on the ref that HEAD points at. Its even
rarer to have a non-HEAD symbolic reference whose reflog you expect to
track something else.

Is this for refs/remotes/origin/HEAD to be a symref and have its
reflog mirror the fetch operations that touched the underlying ref?


Re: reftable [v4]: new ref storage format

2017-08-01 Thread Shawn Pearce
On Mon, Jul 31, 2017 at 11:41 PM, Michael Haggerty  wrote:
> On Sun, Jul 30, 2017 at 8:51 PM, Shawn Pearce  wrote:
>> 4th iteration of the reftable storage format.
>> [...]
>
> Before we commit to Shawn's reftable proposal, I wanted to explore
> what a contrasting design that is not block based would look like. So
> I threw together a sketch of such a design. It is not as refined as
> Shawn's, and I haven't had time to program an implementation or
> collect statistics, but maybe it is interesting anyway.

Thanks for taking the time to write this out. Its constructive to see
other possible approaches.

I roughly implemented your proposed design in JGit for references
only. I skipped the object lookup and log handling for the sake of
studying the basics of this approach.

   | size   | seek_cold | seek_hot  |
mh | 28.3 M | 24.5 usec | 14.5 usec |
sp  4k | 29.2 M | 63.0 usec |  5.8 usec |
sp 64k | 27.7 M | 35.6 usec | 23.3 usec |

All tests were run with a 4K chunk/block size on an 866k ref example.
The mh approach does not use padding between chunks, so chunks are
variable sized, but not exceeding 4K. The differences between cold and
hot is whether or not the file was held open, and the root of the ref
index stays in memory.

In the mh approach with a 4K chunk size and 866k refs, the index is 2
levels deep. A cold seek needs to perform 3 disk reads to navigate the
index, hot seek is 2 disk reads.


> It seems to me that the block-orientation of Shawn's spec adds a lot
> of complication to the design (e.g., padding, optional blocks, varying
> block sizes for different parts of the file). It also seems that the
> format has been heavily optimized to be read in 64k blocks, whereas
> most users will be storing their references to local hard disks or
> (increasingly) to local SSDs. So I tried to come up with a design that
> is friendlier to the latter, hopefully without being much worse for
> the former.

Yes, my design is biased to run well on a 64k block size, where the
underlying disk is slow. Ideally reads require only one or two block
reads. Torn blocks (where a range of data lies on two different
blocks) are expensive, as both 64k blocks have to be loaded to acquire
data that lies across the boundary. Given how fast SSDs are, I don't
think this causes any problems for SSD users.


> One thing that I realized is that, assuming that reference values and
> reflogs are usually compacted separately, there will basically be
> three types of reftable files:
>
> 1. Accumulated references and their values (potentially very large)
>
> 2. Accumulated reflogs (potentially gigantic)
>
> 3. Recent transaction(s), which include both refs and reflogs (usually
>quite small).
>
> The first two types of file don't benefit from separating reference
> data from reflog data in the file (because they only include one or
> the other). The third kind of file doesn't care because such files
> will be small.

In principal, I agree with your simplification. The important thing is
keeping the reflog away from the accumulated references, such that
scans of references (e.g. current ls-remote advertisement) is
efficient. I think I was also aiming to do this even for the
transaction log files, as ls-remote operations don't need the log
record data.


> So let's store all of the data related to a single reference in one
> spot. This simplifies the file format and removes the need for two
> separate indexes and/or variants that special-case the omission of an
> index. A disadvantage is that, it prevents compression of reflog
> entries across references. I think that is a reasonable tradeoff,
> because most reflog entries (in the repositories I checked; that might
> differ in Gerrit-managed repos) correspond to references that have
> been updated many times. To partly compensate for the resulting
> reduction in compression, I added some features to reduce the storage
> of redundant information in reflog entries.

I found the encoding of reflog entries below somewhat complicated, but
I haven't been able to implement it to compare storage size vs. the
deflated block I proposed.


> On the other hand, this design does not support binary searching for
> refnames. Instead, the refnames are distributed into a multiple-level
> tree. When looking for a reference, each tree node on the way to the
> reference needs to be parsed serially. But the nodes are small enough
> (due to the multilevel indexing scheme) that I hope that isn't a
> problem. I'd rather read less data at the expense of parsing a little
> bit more.

Looking at the numbers above, the binary search within a 4k block does
reduce lookup time. E.g. reftable has ~7 binary search restart points
within each 4k block, vs. your approach needing to parse up to 119
refs in a 4k chunk.


> * Multiple related chunks can be stored in a 64 kiB block (interesting
>   for people who prefer to read files in 64k segments). By storing
>   related 

Re: reftable [v4]: new ref storage format

2017-08-01 Thread Shawn Pearce
On Mon, Jul 31, 2017 at 4:43 PM, Shawn Pearce  wrote:
> On Mon, Jul 31, 2017 at 12:42 PM, Junio C Hamano  wrote:
>>
>> As a block cannot be longer than 16MB, allocating uint32 to a
>> restart offset may be a bit overkill.  I do not know if it is worth
>> attempting to pack 1/3 more restart entries in a block by using
>> uint24, though.
>
> I don't think its worth the annoyance in code that has to deal with
> this field. An example 87k ref repository with a 4k block size has ~9
> restarts per block and ~19 bytes of padding. Using uint24 saves us 9
> bytes, yet the average ref here costs 27 bytes. We aren't likely to
> fill the block with another ref, or another restart point.

I thought about this more. We can fit an additional ref per block in
some cases. It saves ~20 KiB for one of my example repositories if
restart_offset is a uint24.

Given that readers have to deal with these being unaligned loads
anyway, its not significantly harder to work with uint24 vs. uint32.
So I've changed the definition to be uint24.


Re: reftable [v4]: new ref storage format

2017-08-01 Thread Shawn Pearce
On Tue, Aug 1, 2017 at 6:54 AM, Dave Borowitz  wrote:
> On Sun, Jul 30, 2017 at 11:51 PM, Shawn Pearce  wrote:
>> - Ref-like files (FETCH_HEAD, MERGE_HEAD) also use type 0x3.
>
>> - Combine reflog storage with ref storage for small transactions.
>> - Separate reflog storage for base refs and historical logs.
>
> How is the stash implemented in reftable? In particular, "git stash
> drop" needs to be able to remove an arbitrary single entry from a
> reflog.
>
> I don't think the current proposal supports writing tombstones for
> reflog entries, so this maybe implies that "stash drop" would have to
> be implemented by rewriting the whole reflog for the stash ref during
> compaction. Can the current compaction algorithm support this?

Yes, the reftable holding the stash would have to be completely
rewritten for a "stash drop" command to be able to remove a reflog
entry.

> I suppose there are more exotic alternatives:
> * Go back to the normal ref(log) format for refs/stash. I figure you
> probably don't want to do this, given that you already moved other
> ref-like files into the reftable in a later revision of this proposal.
> * Implement the whole stash storage/command in some other way that
> doesn't depend on reflog.

I think the simplest is just put refs/stash in its own reftable, and
put that reftable in the stack somewhere. Editing refs/stash reflog
requires rewriting and replacing that reftable in the stack, but
otherwise doesn't impact any other (possibly much larger) reflog.

Pushing things onto refs/stash would be creating new small transaction
reftables on the top of the stack, which should be compacted with the
lower refs/stash table. I guess that means refs/stash needs special
handling in some of the compaction code to keep it isolated.


Re: reftable [v4]: new ref storage format

2017-08-01 Thread Dave Borowitz
On Sun, Jul 30, 2017 at 11:51 PM, Shawn Pearce  wrote:
> - Ref-like files (FETCH_HEAD, MERGE_HEAD) also use type 0x3.

> - Combine reflog storage with ref storage for small transactions.
> - Separate reflog storage for base refs and historical logs.

How is the stash implemented in reftable? In particular, "git stash
drop" needs to be able to remove an arbitrary single entry from a
reflog.

I don't think the current proposal supports writing tombstones for
reflog entries, so this maybe implies that "stash drop" would have to
be implemented by rewriting the whole reflog for the stash ref during
compaction. Can the current compaction algorithm support this?

I suppose there are more exotic alternatives:
* Go back to the normal ref(log) format for refs/stash. I figure you
probably don't want to do this, given that you already moved other
ref-like files into the reftable in a later revision of this proposal.
* Implement the whole stash storage/command in some other way that
doesn't depend on reflog.


Re: reftable [v4]: new ref storage format

2017-08-01 Thread Michael Haggerty
On Sun, Jul 30, 2017 at 8:51 PM, Shawn Pearce  wrote:
> 4th iteration of the reftable storage format.
> [...]

Before we commit to Shawn's reftable proposal, I wanted to explore
what a contrasting design that is not block based would look like. So
I threw together a sketch of such a design. It is not as refined as
Shawn's, and I haven't had time to program an implementation or
collect statistics, but maybe it is interesting anyway.

The design benefits from a lot of Shawn's ideas, plus a lot of the
discussion from the mailing list. First let me explain in words the
major differences and why I think they might be promising.

It seems to me that the block-orientation of Shawn's spec adds a lot
of complication to the design (e.g., padding, optional blocks, varying
block sizes for different parts of the file). It also seems that the
format has been heavily optimized to be read in 64k blocks, whereas
most users will be storing their references to local hard disks or
(increasingly) to local SSDs. So I tried to come up with a design that
is friendlier to the latter, hopefully without being much worse for
the former.

One thing that I realized is that, assuming that reference values and
reflogs are usually compacted separately, there will basically be
three types of reftable files:

1. Accumulated references and their values (potentially very large)

2. Accumulated reflogs (potentially gigantic)

3. Recent transaction(s), which include both refs and reflogs (usually
   quite small).

The first two types of file don't benefit from separating reference
data from reflog data in the file (because they only include one or
the other). The third kind of file doesn't care because such files
will be small.

So let's store all of the data related to a single reference in one
spot. This simplifies the file format and removes the need for two
separate indexes and/or variants that special-case the omission of an
index. A disadvantage is that, it prevents compression of reflog
entries across references. I think that is a reasonable tradeoff,
because most reflog entries (in the repositories I checked; that might
differ in Gerrit-managed repos) correspond to references that have
been updated many times. To partly compensate for the resulting
reduction in compression, I added some features to reduce the storage
of redundant information in reflog entries.

I also wanted to support multiple-level indexes, so that one isn't
forced to choose rather large blocksizes for repositories with lots of
references.

On the other hand, this design does not support binary searching for
refnames. Instead, the refnames are distributed into a multiple-level
tree. When looking for a reference, each tree node on the way to the
reference needs to be parsed serially. But the nodes are small enough
(due to the multilevel indexing scheme) that I hope that isn't a
problem. I'd rather read less data at the expense of parsing a little
bit more.


The primary data structure consists of `INDEX` and `REFS` chunks.
Basically the `INDEX` chunks are interior nodes of the tree, and the
`REFS` chunks are the leaf nodes. But both types of nodes are N-way.

* Prefix compression occurs within a single node chunk, but not across
  nodes. In that way, the node chunks are analogous to a restart plus
  the subsequent prefix-compressed references in Shawn's design (but
  they will typically be bigger).

* A chunk must be read and processed serially to make sense of it.
  Chunks are also the unit of addressing.

* Indexes can be multi-level to keep each level of the tree small.
  This reduces the total amount of data that needs to be read to
  access a reference, but might require an additional seek if the
  required index chunks are not close together.

* Node chunks can be read serially starting at any chunk without
  referring to any index. Also, data are always written in order by
  reference name, both within single chunks and regarding the ordering
  of multiple `REFS` chunks. So references can be iterated over
  without regard to the indexes.

* I feel like the size of a chunk should be roughly 1-4 kiB, and will
  usually fit in one disk block (though it might cross disk blocks,
  requiring two to be read).

* Multiple related chunks can be stored in a 64 kiB block (interesting
  for people who prefer to read files in 64k segments). By storing
  related chunks near each other, the number of seeks will probably
  typically be smaller than the number of chunks that have to be
  accessed.

* Padding can be added between any two chunks (e.g., to avoid
  splitting chunks across 64 kiB blocks), but it is entirely optional
  and probably would not be used for data stored on local storage.

* In many cases, cross-references between chunks are relative, so that
  they can be stored compactly using varint encoding. This is more
  important than in Shawn's proposal because we don't have block
  indexes, so we have to store arbitrary disk offsets.

* The 

Re: reftable [v4]: new ref storage format

2017-07-31 Thread Shawn Pearce
On Mon, Jul 31, 2017 at 12:42 PM, Junio C Hamano  wrote:
> Shawn Pearce  writes:
>
>> ### Peeling
>>
>> References in a reftable are always peeled.
>
> This hopefully means "a record for an annotated (or signed) tag
> records both the tag object and the object it refers to", and does
> not include peeling a commit down to its tree.

Yes, thanks for the clarification. I've added that to this section.


>> ### Reference name encoding
>>
>> Reference names should be encoded with UTF-8.
>
> If we limited ourselves to say that a refname is an uninterpreted
> sequence of bytes that must pass check_refname_format() test, mthen
> we won't open us to confusion such as "are two refs with the same
> Unicode name encoded with different normalization form considered
> the same?"

Ack. I'll reword this as bytes, and point to git-check-ref-format.

> In what way does this "should be in UTF-8" help describing the file
> format, I wonder.

In JGit we have a String object for a reference name. Dropping that
down to a sequence of bytes requires converting it with a character
encoding, such as US-ASCII or UTF-8.


>> ### First ref block
>>
>> The first ref block shares the same block as the file header, and is
>> 24 bytes smaller than all other blocks in the file.  The first block
>> immediately begins after the file header, at offset 24.
>>
>> If the first block is a log block (a log only table), its block header
>> begins immediately at offset 24.
>
> A minor nit: You called such a file "a log-only file"; let's be consistent.

Fixed. I found another that also was inconsistent. Thanks for the
attention to detail.


>> ### Ref block format
>>
>> A ref block is written as:
>>
>> 'r'
>> uint24( block_len )
>> ref_record+
>> uint32( restart_offset )+
>> uint16( restart_count )
>> padding?
>>
>> Blocks begin with `block_type = 'r'` and a 3-byte `block_len` which
>> encodes the number of bytes in the block up to, but not including the
>> optional `padding`.  This is almost always shorter than the file's
>> `block_size`.  In the first ref block, `block_len` includes 24 bytes
>> for the file header.
>
> As a block cannot be longer than 16MB, allocating uint32 to a
> restart offset may be a bit overkill.  I do not know if it is worth
> attempting to pack 1/3 more restart entries in a block by using
> uint24, though.

I don't think its worth the annoyance in code that has to deal with
this field. An example 87k ref repository with a 4k block size has ~9
restarts per block and ~19 bytes of padding. Using uint24 saves us 9
bytes, yet the average ref here costs 27 bytes. We aren't likely to
fill the block with another ref, or another restart point.


>>  ref record
[...]
>> The `value` follows.  Its format is determined by `value_type`, one of
>> the following:
>>
>> - `0x0`: deletion; no value data (see transactions, below)
>> - `0x1`: one 20-byte object id; value of the ref
>> - `0x2`: two 20-byte object ids; value of the ref, peeled target
>> - `0x3`: symref and text: `varint( text_len ) text`
>> - `0x4`: index record (see below)
>> - `0x5`: log record (see below)
[...]
>> Types `0x6..0x7` are reserved for future use.
>
> I wondered if we regret the apparent limited extensibility later,
> but giving 4 bits to value-type would limit suffix length that can
> be represented by a single varint() only to 7, while the format
> described would give us up to 15 bytes.  We can say type 0x7 would
> be followed by another varint() to record the extended type, or
> something, to extend it, so probably what you did here strikes a
> good balance.

Thanks. I think we actually have 0x4-0x7 available in value_type. I
used 0x4 and 0x5 as above only as a suggestion from Stefan to help
someone debug a reader. The block type differs where 0x0-0x3 and 0x4
and 0x5 are used, so there is already sufficient information available
to disambiguate the value_type such that we don't actually have to use
0x4 and 0x5 as I have here.

So I agree with myself, and with you, 3 bits for value_type and 4 bits
for suffix_len is the right balance.


>> ### Ref index
>> ...
>> An index block should only be written if there are at least 4 blocks
>> in the file, as cold reads using the index requires 2 disk reads (read
>> index, read block), and binary searching <= 4 blocks also requires <=
>> 2 reads.  Omitting the index block from smaller files saves space.
>
> I think the last "<= 4" should be "< 4" here.  That is consistent
> with an earlier part of the paragraph that requires at least 4
> ref-blocks in the file, because a reftable with only 3 ref-blocks
> still can be accessed with 2 reads (a reftable with 4 ref-blocks
> without index may need 3 reads as there is no "middle" for binary
> search).
>
> The first sentence should be "if there are at least 4 ref-blocks", I
> guess.

Thanks, clarified.


>> ### Obj block format
>>
>> Object blocks use unique, abbreviated 2-20 byte SHA-1 keys, mapping
>> to ref 

Re: reftable [v4]: new ref storage format

2017-07-31 Thread Shawn Pearce
On Mon, Jul 31, 2017 at 12:01 PM, Stefan Beller  wrote:
> On Sun, Jul 30, 2017 at 8:51 PM, Shawn Pearce  wrote:
>> 4th iteration of the reftable storage format.
>>
>> You can read a rendered version of this here:
>> https://googlers.googlesource.com/sop/jgit/+/reftable/Documentation/technical/reftable.md
>>
>
>> uint16( restart_count )
>
> When looking for 16/32/64 bit hard coded ints I came across
> this one once again. How much more expensive is reading
> a varint? As the block_len points at the restart_count, we cannot
> replace it with a varint easily, but we could use a byte-reversed
> varint instead. If we do this step, all restart offsets could also be
> (reverse) varints?

Its not the expense of decoding a varint, its making the file
structure predictable so its simple to binary search within restart
points. If these were varints, it becomes much more difficult to
divide the remaining range in half and update the boundary conditions
to locate the next mid-point.


Re: reftable [v4]: new ref storage format

2017-07-31 Thread Junio C Hamano
Shawn Pearce  writes:

> ### Peeling
>
> References in a reftable are always peeled.

This hopefully means "a record for an annotated (or signed) tag
records both the tag object and the object it refers to", and does
not include peeling a commit down to its tree.

> ### Reference name encoding
>
> Reference names should be encoded with UTF-8.

If we limited ourselves to say that a refname is an uninterpreted
sequence of bytes that must pass check_refname_format() test, then
we won't open us to confusion such as "are two refs with the same
Unicode name encoded with different normalization form considered
the same?"

In what way does this "should be in UTF-8" help describing the file
format, I wonder.

> ### Directory/file conflicts
>
> The reftable format accepts both `refs/heads/foo` and
> `refs/heads/foo/bar` as distinct references.
>
> This property is useful for retaining log records in reftable, but may
> confuse versions of Git using `$GIT_DIR/refs` directory tree to
> maintain references.  Users of reftable may choose to continue to
> reject `foo` and `foo/bar` type conflicts to prevent problems for
> peers.

This is an interesting one.  I do agree that preserving reflog for
removed refs is a nice propertly to have.

> ### First ref block
>
> The first ref block shares the same block as the file header, and is
> 24 bytes smaller than all other blocks in the file.  The first block
> immediately begins after the file header, at offset 24.
>
> If the first block is a log block (a log only table), its block header
> begins immediately at offset 24.

A minor nit: You called such a file "a log-only file"; let's be consistent.

>
> ### Ref block format
>
> A ref block is written as:
>
> 'r'
> uint24( block_len )
> ref_record+
> uint32( restart_offset )+
> uint16( restart_count )
> padding?
>
> Blocks begin with `block_type = 'r'` and a 3-byte `block_len` which
> encodes the number of bytes in the block up to, but not including the
> optional `padding`.  This is almost always shorter than the file's
> `block_size`.  In the first ref block, `block_len` includes 24 bytes
> for the file header.

As a block cannot be longer than 16MB, allocating uint32 to a
restart offset may be a bit overkill.  I do not know if it is worth
attempting to pack 1/3 more restart entries in a block by using
uint24, though.

> The end of the block may be filled with `padding` NUL bytes to fill
> out the block to the common `block_size` as specified in the file
> header.  Padding may be necessary to ensure the following block starts
> at a block alignment, and does not spill into the tail of this block.
> Padding may be omitted if the block is the last block of the file, and
> there is no index block.  This allows reftable to efficiently scale
> down to a small number of refs.

We may want to phrase it in a way that it is more clear that
padding, if exists, must be filled with NUL bytes, not arbitrary
garbage.  Your version may be clear enough already.  I dunno.

>  ref record
>
> A `ref_record` describes a single reference, storing both the name and
> its value(s). Records are formatted as:
>
> varint( prefix_length )

Just like we saw that "uintX are network byte order" upfront, it may
be easier to give the definition, or at least an outline, of varint()
near there.

> varint( (suffix_length << 3) | value_type )
> suffix
> value?
>
> The `prefix_length` field specifies how many leading bytes of the
> prior reference record's name should be copied to obtain this
> reference's name.  This must be 0 for the first reference in any
> block, and also must be 0 for any `ref_record` whose offset is listed
> in the `restart_offset` table at the end of the block.
>
> Recovering a reference name from any `ref_record` is a simple concat:
>
> this_name = prior_name[0..prefix_length] + suffix
>
> The `suffix_length` value provides the number of bytes to copy from
> `suffix` to complete the reference name.
>
> The `value` follows.  Its format is determined by `value_type`, one of
> the following:
>
> - `0x0`: deletion; no value data (see transactions, below)
> - `0x1`: one 20-byte object id; value of the ref
> - `0x2`: two 20-byte object ids; value of the ref, peeled target
> - `0x3`: symref and text: `varint( text_len ) text`
> - `0x4`: index record (see below)
> - `0x5`: log record (see below)
>
> Symbolic references use `0x3` with a `text` string starting with `"ref: "`,
> followed by the complete name of the reference target.  No
> compression is applied to the target name.  Other types of contents
> that are also reference like, such as `FETCH_HEAD` and `MERGE_HEAD`,
> may also be stored using type `0x3`.
>
> Types `0x6..0x7` are reserved for future use.

I wondered if we regret the apparent limited extensibility later,
but giving 4 bits to value-type would limit suffix length that can
be represented by a single varint() only to 7, while the format
described would give us 

Re: reftable [v4]: new ref storage format

2017-07-31 Thread Stefan Beller
On Sun, Jul 30, 2017 at 8:51 PM, Shawn Pearce  wrote:
> 4th iteration of the reftable storage format.
>
> You can read a rendered version of this here:
> https://googlers.googlesource.com/sop/jgit/+/reftable/Documentation/technical/reftable.md
>
> Significant changes from v3:
> - Incorporated Michael Haggerty's update_index concept for reflog.
> - Explicitly documented log-only tables.

I have read through v4 and I was missing the rationale
to why this is a good idea, digging up the discussion for
v3 seems to indicate that reflogs and refs themselves have
different usage patterns such that different compaction patterns
are desired, hence we need to have different files for them.

> ### Ref block format
>
> A ref block is written as:
>
> 'r'
> uint24( block_len )
> ref_record+
> uint32( restart_offset )+

As the block_size is encoded in uint24, (and so is block_len),
the restart offsets could be made uint24 as well, though that may
have alignment issues, such that reading may be slower.
But as the ref_records may produce unaligned 32 bit ints
already, I am not worried about that.

> uint16( restart_count )

When looking for 16/32/64 bit hard coded ints I came across
this one once again. How much more expensive is reading
a varint? As the block_len points at the restart_count, we cannot
replace it with a varint easily, but we could use a byte-reversed
varint instead. If we do this step, all restart offsets could also be
(reverse) varints?


Re: reftable [v4]: new ref storage format

2017-07-31 Thread Dave Borowitz
On Sun, Jul 30, 2017 at 11:51 PM, Shawn Pearce  wrote:
> - Near constant time verification a SHA-1 is referred to by at least
>   one reference (for allow-tip-sha1-in-want).

I think I understated the importance of this when I originally brought
up allow-tip-sha1-in-want. This is an important optimization for *any*
HTTP server, even without allow-tip-sha1-in-want, in order to validate
the SHA-1s sent in the upload-pack request, which doesn't share memory
state with the /info/refs request processing.


Re: reftable [v4]: new ref storage format

2017-07-30 Thread Shawn Pearce
4th iteration of the reftable storage format.

You can read a rendered version of this here:
https://googlers.googlesource.com/sop/jgit/+/reftable/Documentation/technical/reftable.md

Significant changes from v3:
- Incorporated Michael Haggerty's update_index concept for reflog.
- Explicitly documented log-only tables.

- Magic changed to 'REFT'
- Symlinks now use type 0x3 with "ref: " prefix.
- Ref-like files (FETCH_HEAD, MERGE_HEAD) also use type 0x3.
- restart_count simplified, obj block_count simplified


# reftable

[TOC]

## Overview

### Problem statement

Some repositories contain a lot of references (e.g.  android at 866k,
rails at 31k).  The existing packed-refs format takes up a lot of
space (e.g.  62M), and does not scale with additional references.
Lookup of a single reference requires linearly scanning the file.

Atomic pushes modifying multiple references require copying the
entire packed-refs file, which can be a considerable amount of data
moved (e.g. 62M in, 62M out) for even small transactions (2 refs
modified).

Repositories with many loose references occupy a large number of disk
blocks from the local file system, as each reference is its own file
storing 41 bytes (and another file for the corresponding reflog).
This negatively affects the number of inodes available when a large
number of repositories are stored on the same filesystem.  Readers can
be penalized due to the larger number of syscalls required to traverse
and read the `$GIT_DIR/refs` directory.

### Objectives

- Near constant time lookup for any single reference, even when the
  repository is cold and not in process or kernel cache.
- Near constant time verification a SHA-1 is referred to by at least
  one reference (for allow-tip-sha1-in-want).
- Efficient lookup of an entire namespace, such as `refs/tags/`.
- Support atomic push with `O(size_of_update)` operations.
- Combine reflog storage with ref storage for small transactions.
- Separate reflog storage for base refs and historical logs.

### Description

A reftable file is a portable binary file format customized for
reference storage. References are sorted, enabling linear scans,
binary search lookup, and range scans.

Storage in the file is organized into blocks.  Prefix compression
is used within a single block to reduce disk space.  Block size is
tunable by the writer.

### Performance

Space used, packed-refs vs. reftable:

repository | packed-refs | reftable | % original | avg ref  | avg obj
---|:|-:|---:|-:|:
android|  62.2 M |   34.4 M | 55.2%  | 33 bytes | 8 bytes
rails  |   1.8 M |1.1 M | 57.7%  | 29 bytes | 6 bytes
git|  78.7 K |   44.0 K | 60.0%  | 50 bytes | 6 bytes
git (heads)|   332 b |276 b | 83.1%  | 34 bytes | 0 bytes

Scan (read 866k refs), by reference name lookup (single ref from 866k
refs), and by SHA-1 lookup (refs with that SHA-1, from 866k refs):

format  | cache | scan| by name| by SHA-1
|--:|:|---:|---:
packed-refs | cold  |  402 ms | 409,660.1 usec | 412,535.8 usec
packed-refs | hot   | |   6,844.6 usec |  20,110.1 usec
reftable| cold  |  112 ms |  33.9 usec | 323.2 usec
reftable| hot   | |  20.2 usec | 320.8 usec

Space used for 149,932 log entries for 43,061 refs,
reflog vs. reftable:

format| size  | avg entry
--|--:|---:
$GIT_DIR/logs | 173 M | 1209 bytes
reftable  |   5 M |   37 bytes

## Details

### Peeling

References in a reftable are always peeled.

### Reference name encoding

Reference names should be encoded with UTF-8.

### Network byte order

All multi-byte, fixed width fields are in network byte order.

### Ordering

Blocks are lexicographically ordered by their first reference.

### Directory/file conflicts

The reftable format accepts both `refs/heads/foo` and
`refs/heads/foo/bar` as distinct references.

This property is useful for retaining log records in reftable, but may
confuse versions of Git using `$GIT_DIR/refs` directory tree to
maintain references.  Users of reftable may choose to continue to
reject `foo` and `foo/bar` type conflicts to prevent problems for
peers.

## File format

### Structure

A reftable file has the following high-level structure:

first_block {
  header
  first_ref_block
}
ref_blocks*
ref_index?
obj_blocks*
obj_index?
log_blocks*
log_index?
footer

A log-only file omits the `ref_blocks`, `ref_index`, `obj_blocks` and
`obj_index` sections, containing only the file header and log blocks:

first_block {
  header
}
log_blocks*
log_index?
footer

in a log-only file the first log block immediately follows the file
header, without padding to block alignment.

### Block size

The `block_size` is arbitrarily determined by the writer, and does not
have to be a power of 2.  The block size