Re: reftable: new ref storage format

2017-07-23 Thread Shawn Pearce
On Sun, Jul 23, 2017 at 3:56 PM, Shawn Pearce  wrote:
> On Mon, Jul 17, 2017 at 6:43 PM, Michael Haggerty  
> wrote:
>> On Sun, Jul 16, 2017 at 12:43 PM, Shawn Pearce  wrote:
>>> On Sun, Jul 16, 2017 at 10:33 AM, Michael Haggerty  
>>> wrote:
>
>> * What would you think about being extravagant and making the
>> value_type a full byte? It would make the format a tiny bit easier to
>> work with, and would leave room for future enhancements (e.g.,
>> pseudorefs, peeled symrefs, support for the successors of SHA-1s)
>> without having to change the file format dramatically.
>
> I reran my 866k file with full byte value_type. It pushes up the
> average bytes per ref from 33 to 34, but the overall file size is
> still 28M (with 64 block size). I think its reasonable to expand this
> to the full byte as you suggest.

FYI, I went back on this in the v3 draft I posted on Jul 22 in
https://public-inbox.org/git/CAJo=hJvxWg2J-yRiCK3szux=eym2thjt0kwo-sffooc1rkx...@mail.gmail.com/

I expanded value_type from 2 bits to 3 bits, but kept it as a bit
field in a varint. I just couldn't justify the additional byte per ref
in these large files. The prefix compression works well enough that
many refs are still able to use only a single byte for the
suffix_length << 3 | value_type varint, keeping the average at 33
bytes per ref.

The reftable format uses values 0-3, leaving 4-7 available. I reserved
4 for an arbitrary payload like MERGE_HEAD type files.


Re: reftable: new ref storage format

2017-07-23 Thread Shawn Pearce
+git@vger.kernel.org. I originally sent the below reply privately by mistake.

On Mon, Jul 17, 2017 at 6:43 PM, Michael Haggerty  wrote:
> On Sun, Jul 16, 2017 at 12:43 PM, Shawn Pearce  wrote:
>> On Sun, Jul 16, 2017 at 10:33 AM, Michael Haggerty  
>> wrote:
>
> On second thought, the idea of having HEAD (or maybe all pseudorefs)
> in the same system would open a few interesting possibilities that
> derive from having a global, atomic view of all references:
>
> 1. We could store backlinks from references to the symbolic references
> that refer to them. This would allow us to update the reflogs for
> symbolic refs properly. (Currently, there is special-case code to
> update the reflogs for HEAD when the reference that it points at is
> modified, but not for other symrefs.)

This is a good idea, but makes for some difficult transition code. We
have to keep the special case for HEAD, but other symrefs would log
when in a reftable.

> 2. We could store "peeled" versions of symbolic refs. These would have
> to be updated whenever the pointed-at reference is updated, but that
> would have two nice advantages: HEAD would usually be resolvable based
> on the top reftable in the stack, and it would be resolvable in one
> step (without having the follow the symref explicitly).

Great observation. I wish I saw that sooner. Its a pain in the neck to
resolve symrefs, and has caused us a few bugs in JGit on our current
non-standard storage. It depends on the back pointer being present and
accurate to ensure an update of master also updates the cached HEAD.

I'll have to mull on these a bit. I'm not folding them into my
documentation and implementation just yet.


[...]
> I'm still not quite resigned to non-Google users wanting to use blocks
> as large as 64k, but (short of doing actual experiments, yuck!) I
> can't estimate whether it would make any detectable difference in the
> real world.

I think it is only likely to matter with NFS, and then its a balancing
act of how much of that block did you need vs. not need. :)

> On the other end of the spectrum, I might mention that the
> shared-storage "network.git" repositories that we use at GitHub often
> have a colossal number of references (basically, the sum of the number
> of references in all of the forks in a "repository network", including
> some hidden references that users don't see). For example, one
> "network.git" repository has 56M references(!) Mercifully, we
> currently only have to access these repositories for batch jobs, but,
> given a better reference storage backend, that might change.

A larger block size right now has the advantage of a smaller index,
which could make a single ref lookup more efficient. Otherwise, the
block size doesn't have a big impact on streaming through many
references.


>>> 2. The stacking of multiple reftable files together
[...]
>> At $DAY_JOB we can do this successfully with pack files, which are
>> larger and more costly to combine than reftable. I think we can get
>> reftable to do support a reasonable stack depth.
>
> Are you saying that you merge subsets of packfiles without merging all
> of them? Does this work together with bitmaps, or do you only have
> bitmaps for the biggest packfile?
>
> We've thought about merging packfiles in that way, but don't want to
> give up the benefits of bitmaps.

Yes. We compact smaller pack files together into a larger pack file,
and try to keep a repository at:

 - 2 compacted packs, each <20 MiB
 - 1 base pack + bitmap

We issue a daily GC for any repository that isn't just 1 base pack.
But during a business day this compacting process lets us handle most
read traffic quite well, despite the bitmaps being incomplete.


>>> I haven't reviewed your proposal for storing reflogs in reftables in
[...]
>
> Those sizes don't sound that scary. Do your reflogs include
> significant information in the log messages, or are they all "push"
> "push" "push"? We record quite a bit of information in our audit_log
> entries (our equivalent of reflogs), so I would expect ours to
> compress much less well.

These were pretty sparse in the comment field, and frequent reuse of a
message. So it may not be representative of what you are storing.

> We also tend to use our audit_logs to see what was happening in a
> repository; e.g., around the time that a problem occurred. So for us
> it is useful that the entries are in chronological order across
> references, as opposed to having the entries for each reference
> grouped together. We might be the oddballs here though, and in fact it
> is possible that this would be an argument for us to stick to our
> audit_log scheme rather than use reflogs stored in reftables.

I think of reflogs about a single ref, not the whole repository. So
I'm inclined to say the reftable storage of them should be by ref,
then time. Anyone who wants a repository view must either scan the
entire log segment, or 

Re: reftable: new ref storage format

2017-07-18 Thread Junio C Hamano
Michael Haggerty  writes:

> On second thought, the idea of having HEAD (or maybe all pseudorefs)
> in the same system would open a few interesting possibilities that
> derive from having a global, atomic view of all references:
>
> 1. We could store backlinks from references to the symbolic references
> that refer to them. This would allow us to update the reflogs for
> symbolic refs properly. (Currently, there is special-case code to
> update the reflogs for HEAD when the reference that it points at is
> modified, but not for other symrefs.)
>
> 2. We could store "peeled" versions of symbolic refs. These would have
> to be updated whenever the pointed-at reference is updated, but that
> would have two nice advantages: HEAD would usually be resolvable based
> on the top reftable in the stack, and it would be resolvable in one
> step (without having the follow the symref explicitly).

Interesting.  I think FETCH_HEAD is the only thing that would not
mesh well with the "one ref records one object name, or refers to
another ref" paradigm, and I think it is OK to leave it that way,
even in the new pluggable ref backend world order.

It still bothers me that the way refs.c special-cases what you call
pseudo refs seems somewhat inconsistent, which I found by accident
the other day while running t1405 and reported in a separate thread.
When we start adding a reftable backend, I suspect we'd need to dig
further, but if I recall the symptom correctly, writing them out is
still passed to the main (i.e. files) backend, while reading them is
done directly in refs.c layer without consulting any backend, and
that made it impossible to optionally tweak the filename used to
store refs in the files backend.






Re: reftable: new ref storage format

2017-07-17 Thread Michael Haggerty
On Sun, Jul 16, 2017 at 12:43 PM, Shawn Pearce  wrote:
> On Sun, Jul 16, 2017 at 10:33 AM, Michael Haggerty  
> wrote:
>> * Do you propose to store *all* references (i.e., including the
>> references that we call pseudorefs, like `HEAD`, `FETCH_HEAD`, etc) in
>> reftables, or only the references under `refs/`? If the former, then
>> you need to consider that some pseudorefs contain additional
>> information besides the SHA-1 or linked-to reference. If the latter,
>> then you could get some additional compression by making the `refs/`
>> prefix implicit rather than storing it in the "restart" records
>> explicitly.
>
> Great question I didn't have an answer for. I was planning to store
> HEAD, but not FETCH_HEAD/MERGE_HEAD/etc. Mostly because our storage
> system at $DAY_JOB doesn't have a place to store HEAD itself, its
> buried down in the reference system. So reftable has to do the job.
>
> The file format for reftable describes a type 0x3 which is just a
> length delimited byte string. Provided that the data fits into a
> single block, reftable can store these larger files with their
> auxiliary data.
>
> I'm open to the idea of HEAD and friends being outside of reftable in
> a git-core accessed repository, but the backend storage I want to run
> reftable on would likely need to store HEAD inside reftable.
>
>>   Personally, I don't think it makes sense to store pseudorefs in
>> reftables. HEAD, though it doesn't include any supplemental
>> information, is read so frequently that it seems like a bad idea to
>> have to go through the lookup process rather than storing it in a
>> separate flat file. Moreover, HEAD is written very *infrequently*, so
>> (absent special treatment) it would tend to sink deep in the reftable
>> stack, making reads even more expensive.
>
> That is a very fair argument for keeping HEAD outside.
>
> A counter argument is HEAD is written very frequently by following its
> indirect pointer to the branch (e.g. refs/heads/master). HEAD is
> consequently reflogged very frequently. reftable stores the logs
> inside the table shards. HEAD could be floated onto the top on every
> write to accompany its log record.

On second thought, the idea of having HEAD (or maybe all pseudorefs)
in the same system would open a few interesting possibilities that
derive from having a global, atomic view of all references:

1. We could store backlinks from references to the symbolic references
that refer to them. This would allow us to update the reflogs for
symbolic refs properly. (Currently, there is special-case code to
update the reflogs for HEAD when the reference that it points at is
modified, but not for other symrefs.)

2. We could store "peeled" versions of symbolic refs. These would have
to be updated whenever the pointed-at reference is updated, but that
would have two nice advantages: HEAD would usually be resolvable based
on the top reftable in the stack, and it would be resolvable in one
step (without having the follow the symref explicitly).

> [...]
>> * The tuning parameter number_of_restarts currently trades off space
>> (for the full refnames and the restart_offsets) against the need to
>> read and parse more ref_records to get the full refnames. ISTM that
>> this tradeoff could be made less painful by defining a blockwide
>> prefix that is omitted from the refnames as used in the restarts. So
>> the full refname would change from
>>
>>   this_name = prior_name[0..prefix_length] + suffix
>>
>>   to
>>
>>   this_name = block_prefix + prior_name[0..prefix_length] + suffix
>>
>>   I would expect this to allow more frequent restarts at lower space
>> cost.
>
> I've been on the fence about the value of this. It makes the search
> with restarts more difficult to implement, but does allow shrinking a
> handful of very popular prefixes like "refs/" and "refs/pulls/" in
> some blocks.
>
> An older format of reftable used only a block_prefix, and could not
> get nearly as good compression as too many blocks contained references
> with different prefixes.

It's clear that using *only* a block_prefix wouldn't yield as good
compression as your proposed scheme. And it feels plausible that when
using 64k blocks, a block_prefix wouldn't help much.

I'm still not quite resigned to non-Google users wanting to use blocks
as large as 64k, but (short of doing actual experiments, yuck!) I
can't estimate whether it would make any detectable difference in the
real world.

On the other end of the spectrum, I might mention that the
shared-storage "network.git" repositories that we use at GitHub often
have a colossal number of references (basically, the sum of the number
of references in all of the forks in a "repository network", including
some hidden references that users don't see). For example, one
"network.git" repository has 56M references(!) Mercifully, we
currently only have to access these repositories for batch jobs, but,
given a better reference 

Re: reftable: new ref storage format

2017-07-16 Thread Shawn Pearce
On Sun, Jul 16, 2017 at 2:13 PM, Dave Borowitz  wrote:
> On Sun, Jul 16, 2017 at 3:43 PM, Shawn Pearce  wrote:
>> True... but... in my "android" example repository we have 866,456 live
>> refs. A block size of 64k needs only 443 blocks, and a 12k index, to
>> get the file to compress to 28M (vs. 62M packed-refs).
>>
>> Index records are averaging 28 bytes per block. That gives us room for
>> about 1955 blocks, or 4,574,700 refs before the index block exceeds
>> 64k.
>
> That's only a 5x increase over the current number of refs in this
> android repo. I would not be so sure this repo doesn't grow another 5x
> in the next few years. Especially as the other optimizations for
> working with large repos start to be applied, so it won't be
> prohibitively painful to work with such a repo.
>
> Are we ok with increasing the block size when this eventually happens?
> (At least I think that's what we would have to do, I haven't been
> following closely the discussion on scaling limits.)

I think I'd try letting the index grow to 4 blocks (256k) before I
considered increasing the block size. Remember pack idx files are much
larger, and are loaded wholesale into memory by JGit. A ref idx at
256k might not be problematic.


Re: reftable: new ref storage format

2017-07-16 Thread Dave Borowitz
On Sun, Jul 16, 2017 at 3:43 PM, Shawn Pearce  wrote:
> True... but... in my "android" example repository we have 866,456 live
> refs. A block size of 64k needs only 443 blocks, and a 12k index, to
> get the file to compress to 28M (vs. 62M packed-refs).
>
> Index records are averaging 28 bytes per block. That gives us room for
> about 1955 blocks, or 4,574,700 refs before the index block exceeds
> 64k.

That's only a 5x increase over the current number of refs in this
android repo. I would not be so sure this repo doesn't grow another 5x
in the next few years. Especially as the other optimizations for
working with large repos start to be applied, so it won't be
prohibitively painful to work with such a repo.

Are we ok with increasing the block size when this eventually happens?
(At least I think that's what we would have to do, I haven't been
following closely the discussion on scaling limits.)


Re: reftable: new ref storage format

2017-07-16 Thread Shawn Pearce
On Sun, Jul 16, 2017 at 12:43 PM, Shawn Pearce  wrote:
> On Sun, Jul 16, 2017 at 10:33 AM, Michael Haggerty  
> wrote:
>
>> * The tuning parameter number_of_restarts currently trades off space
>> (for the full refnames and the restart_offsets) against the need to
>> read and parse more ref_records to get the full refnames. ISTM that
>> this tradeoff could be made less painful by defining a blockwide
>> prefix that is omitted from the refnames as used in the restarts. So
>> the full refname would change from
>>
>>   this_name = prior_name[0..prefix_length] + suffix
>>
>>   to
>>
>>   this_name = block_prefix + prior_name[0..prefix_length] + suffix
>>
>>   I would expect this to allow more frequent restarts at lower space
>> cost.
>
> I've been on the fence about the value of this. It makes the search
> with restarts more difficult to implement, but does allow shrinking a
> handful of very popular prefixes like "refs/" and "refs/pulls/" in
> some blocks.
>
> An older format of reftable used only a block_prefix, and could not
> get nearly as good compression as too many blocks contained references
> with different prefixes.


I ran an experiment on my 866k ref data set. Using a block_prefix gets
less compression, and doesn't improve packing in the file. Given the
additional code complexity, it really isn't worth it:

format   |  size|  blocks |  avg ref/blk
--|--|---|
original  | 28 M   |   443|  1955
block_prefix  |  29 M  |   464| 1867

:-(


Re: reftable: new ref storage format

2017-07-16 Thread Shawn Pearce
On Sun, Jul 16, 2017 at 10:33 AM, Michael Haggerty  wrote:
> Thanks for your reftable proposal.

Thanks for your time reading the proposal. I really was looking
forward to your insights on this project.

> It would solve a lot of scalability
> problems that we currently have, and do it in a way that is
> implementable in both C and Java, which is very nice.
>
> There are two mostly orthogonal components to your proposal:
>
> 1. What does a single reftable file look like?
> 2. How can multiple reftable files be used together to avoid having to
> rewrite data more than necessary?
>
> For example (just for the sake of argument), many of the goals could
> be achieved by stacking traditional packed-refs files together (i.e.,
> component 2 of your proposal),

Agreed. I actually started from stacked packed-refs as the format
proposal and iterated on that several times before starting to draft
reftable. I like that packed-refs is already a format, and its human
readable. It unfortunately still didn't solve enough other objectives,
which led me towards reftable.


> * Do you propose to store *all* references (i.e., including the
> references that we call pseudorefs, like `HEAD`, `FETCH_HEAD`, etc) in
> reftables, or only the references under `refs/`? If the former, then
> you need to consider that some pseudorefs contain additional
> information besides the SHA-1 or linked-to reference. If the latter,
> then you could get some additional compression by making the `refs/`
> prefix implicit rather than storing it in the "restart" records
> explicitly.

Great question I didn't have an answer for. I was planning to store
HEAD, but not FETCH_HEAD/MERGE_HEAD/etc. Mostly because our storage
system at $DAY_JOB doesn't have a place to store HEAD itself, its
buried down in the reference system. So reftable has to do the job.

The file format for reftable describes a type 0x3 which is just a
length delimited byte string. Provided that the data fits into a
single block, reftable can store these larger files with their
auxiliary data.

I'm open to the idea of HEAD and friends being outside of reftable in
a git-core accessed repository, but the backend storage I want to run
reftable on would likely need to store HEAD inside reftable.


>   Personally, I don't think it makes sense to store pseudorefs in
> reftables. HEAD, though it doesn't include any supplemental
> information, is read so frequently that it seems like a bad idea to
> have to go through the lookup process rather than storing it in a
> separate flat file. Moreover, HEAD is written very *infrequently*, so
> (absent special treatment) it would tend to sink deep in the reftable
> stack, making reads even more expensive.

That is a very fair argument for keeping HEAD outside.

A counter argument is HEAD is written very frequently by following its
indirect pointer to the branch (e.g. refs/heads/master). HEAD is
consequently reflogged very frequently. reftable stores the logs
inside the table shards. HEAD could be floated onto the top on every
write to accompany its log record.


> * You have obviously designed your proposal to be friendly to whatever
> non-local storage system you are using at Google. It would probably
> help us understand your design tradeoffs better if you could tell us a
> little about that storage system.

Remote network attached disk. Random 64k seeks are O(40ms).
Smaller seeks cost about the same as 64k.
Larger seeks start to see the block size increase latency.

Caching is the responsibility of the application (e.g. JGit).

Tail-block packing is available for small files (ResierFS-ish, but its
not ResierFS). A 4k minimum file size occupies the entire 4k, and has
to transfer between the CPU and the disk. A 300 byte file occupies 300
bytes. To the extent the file scales down with its content, the better
(for this system).


> So let's examine the components one after the other...
>
> 1. The structure of a single reftable file
>
> * You use int32 and int24 values in a number of places. Couldn't these
> be uint32 and uint24?

Yes. I'll clarify that in the document.


> * You use int32 values in a number of places where a smaller size
> would suffice. For example, the block size is limited to 24 bits, so
> surely restart_offset, record_end_offset, and number_of_restarts
> needn't be larger than 24 bits?
>
>OK, actually the *index* block size is not limited to 24 bits, so
> it's not a no-brainer.

Right, its the index block that really pushes the restart_offsets to
32 bits. And I wanted to leverage code between the multiple block
formats as much as possible, as the structure is similar.


> * I'm a little bit concerned that the structure is fixed to be a one-
> or two-level indexing scheme; i.e., the (optional) index block plus
> the restart index table within a block. This forces (as I understand
> it) the block size to be chosen based on the overall file size to
> prevent the index block from becoming too large, whereas 

Re: reftable: new ref storage format

2017-07-16 Thread Michael Haggerty
Thanks for your reftable proposal. It would solve a lot of scalability
problems that we currently have, and do it in a way that is
implementable in both C and Java, which is very nice.

There are two mostly orthogonal components to your proposal:

1. What does a single reftable file look like?
2. How can multiple reftable files be used together to avoid having to
rewrite data more than necessary?

For example (just for the sake of argument), many of the goals could
be achieved by stacking traditional packed-refs files together (i.e.,
component 2 of your proposal), with the single extension that a
packed-refs file can assign some "nil" value to a reference to
indicate that the reference has been deleted. As long as references
are sought in packed-refs files using binary search, lookup and
iteration would both be O(n + lg N) (where N is the total number of
references and n is the number being iterated over), and updates would
(I think) be amortized O(n), where n is the number of references being
updated or deleted. Stacked packed-refs files would, of course, not
yield the compression benefits of your reftable proposal.

Overall comments/questions:

* Do you propose to store *all* references (i.e., including the
references that we call pseudorefs, like `HEAD`, `FETCH_HEAD`, etc) in
reftables, or only the references under `refs/`? If the former, then
you need to consider that some pseudorefs contain additional
information besides the SHA-1 or linked-to reference. If the latter,
then you could get some additional compression by making the `refs/`
prefix implicit rather than storing it in the "restart" records
explicitly.

  Personally, I don't think it makes sense to store pseudorefs in
reftables. HEAD, though it doesn't include any supplemental
information, is read so frequently that it seems like a bad idea to
have to go through the lookup process rather than storing it in a
separate flat file. Moreover, HEAD is written very *infrequently*, so
(absent special treatment) it would tend to sink deep in the reftable
stack, making reads even more expensive.

* You have obviously designed your proposal to be friendly to whatever
non-local storage system you are using at Google. It would probably
help us understand your design tradeoffs better if you could tell us a
little about that storage system.

So let's examine the components one after the other...

1. The structure of a single reftable file

* You use int32 and int24 values in a number of places. Couldn't these
be uint32 and uint24?

* You use int32 values in a number of places where a smaller size
would suffice. For example, the block size is limited to 24 bits, so
surely restart_offset, record_end_offset, and number_of_restarts
needn't be larger than 24 bits?

   OK, actually the *index* block size is not limited to 24 bits, so
it's not a no-brainer.

* I'm a little bit concerned that the structure is fixed to be a one-
or two-level indexing scheme; i.e., the (optional) index block plus
the restart index table within a block. This forces (as I understand
it) the block size to be chosen based on the overall file size to
prevent the index block from becoming too large, whereas for a
low-latency local filesystem implementation like SSDs it would
probably be preferable to set the block size to agree with the page
size to minimize reads and memory usage.

  So I ask myself whether it might be worthwhile to allow deeper
levels of indexing. The main index block could divide the namespace
into as many segments as fit in one block, but if necessary point at a
second-level index block that further divides the namespace before
pointing at the ref block that ultimately contains the value. Those of
us using local SSDs might tune the system to use 4k block sizes
throughout, basically preferring to read three or even four disjoint
4k blocks rather than two disjoint 64k blocks.

* The tuning parameter number_of_restarts currently trades off space
(for the full refnames and the restart_offsets) against the need to
read and parse more ref_records to get the full refnames. ISTM that
this tradeoff could be made less painful by defining a blockwide
prefix that is omitted from the refnames as used in the restarts. So
the full refname would change from

  this_name = prior_name[0..prefix_length] + suffix

  to

  this_name = block_prefix + prior_name[0..prefix_length] + suffix

  I would expect this to allow more frequent restarts at lower space
cost. Combining this idea with the previous idea, non-top-level index
blocks could also have a block_prefix to make their restarts cheaper,
too.

  If we are willing to force all reads to consider the indexes, the
block_prefix could be taken from the index_record that points at it.

* Does the format need to insist on fixed-size blocks? The alternative
would be to allow index_records to point at arbitrary offsets in the
file. Admittedly, the file offsets would take more bits to store them
than the block_idx. But it would allow 

Re: reftable: new ref storage format

2017-07-16 Thread Johannes Sixt

Am 16.07.2017 um 12:03 schrieb Jeff King:

On Sun, Jul 16, 2017 at 10:07:57AM +0200, Johannes Sixt wrote:


Am 14.07.2017 um 22:08 schrieb Jeff King:

The implementation on this doesn't seem overly complex. My main concerns
are what we're asking from the filesystem in terms of atomicity, and
what possible races there are.


One of the failure modes is that on Windows a file cannot be deleted while
it is open in any process. It can happen that a compacting updater wants to
remove a reftable file that is still open in a reader.


Good point. I think the explicit pointers I mentioned are an improvement
there, because a compacting updater _can_ leave the file in place if the
delete fails (and later, another compaction can clean up cruft that was
left).


Yes, I think so, too. The pointers make things so much simpler.


I assume that's more or less how pack deletion works on Windows.


Correct.

-- Hannes


Re: reftable: new ref storage format

2017-07-16 Thread Jeff King
On Sun, Jul 16, 2017 at 10:07:57AM +0200, Johannes Sixt wrote:

> Am 14.07.2017 um 22:08 schrieb Jeff King:
> > The implementation on this doesn't seem overly complex. My main concerns
> > are what we're asking from the filesystem in terms of atomicity, and
> > what possible races there are.
> 
> One of the failure modes is that on Windows a file cannot be deleted while
> it is open in any process. It can happen that a compacting updater wants to
> remove a reftable file that is still open in a reader.

Good point. I think the explicit pointers I mentioned are an improvement
there, because a compacting updater _can_ leave the file in place if the
delete fails (and later, another compaction can clean up cruft that was
left).

I assume that's more or less how pack deletion works on Windows.

-Peff


Re: reftable: new ref storage format

2017-07-16 Thread Jeff King
On Sat, Jul 15, 2017 at 11:01:47PM -0700, Shawn Pearce wrote:

> > How deep would you anticipate stacks getting? Would it be feasible for
> > the tip to contain the names of the tables in the entire chain? If we're
> > talking about 20 (or even 32) bytes per name, you could still fit over a
> > hundred names in a 4K inode.
> 
> I think we'd want to keep the stacks under 32, which is a reasonable
> amount of space used in the header of each reftable. I don't have this
> yet in my updated document + implementation, but I'll look at trying
> to add it over the next couple of days. Your idea to hold the explicit
> list of the stack in each reftable makes for a very safe atomic reader
> view.

Great.  I was thinking about this a bit and have one more possible
tweak.

If you store the names of the dependent reftables in each table, then
you end up using storage quadratic in the size of the stack. Because
the bottom-most table has 0 pointers, then the next one has 1, and then
next one has 2, and so on, until the nth one has n.

Now we're talking about n=32 here, so that's probably OK.

But one variant is that the reftables _don't_ know about their
ancestors. Instead, the list of reftables is kept in a top-level pointer
file, and it's that pointer file which is rewritten on update. I.e., a
write is something like:
 
   1. Take reftable.lock

   2. Write reftables/1234abcd to represent your update.

   3. Copy the old reftable to reftable.lock, then append "1234abcd".

   4. Atomic rename into place.

And the reader is just:

  1. Open reftable, read the list of tables.

  2. In parallel, open/fetch each of the tables and find your starting
 pointer for iteration/lookup.

  3. Do a list-merge on the open tables.

The one thing you lose is that "unreachable" reftables no longer form a
meaningful hierarchy. With the pointers inside the reftables themselves,
if your "reftable" file got corrupted, you could find the dangling table
at the apex of the graph and have a good guess at the ref state.
Without, you just have a jumble of states and you don't know which takes
precedence (though you could probably make a good guess from mtimes).

> I added log support to the reftable format. I updated [1] to reflect
> log blocks at the end of the file. I ran a year's worth of log
> records, 149,932 log entries on 43,061 refs to test:

Cool. I'll be on vacation for the next week, so apologies if I don't
keep the discussion going. But I'm very excited about the overall
direction. :)

-Peff


Re: reftable: new ref storage format

2017-07-16 Thread Johannes Sixt

Am 14.07.2017 um 22:08 schrieb Jeff King:

The implementation on this doesn't seem overly complex. My main concerns
are what we're asking from the filesystem in terms of atomicity, and
what possible races there are.


One of the failure modes is that on Windows a file cannot be deleted 
while it is open in any process. It can happen that a compacting updater 
wants to remove a reftable file that is still open in a reader.


-- Hannes


Re: reftable: new ref storage format

2017-07-16 Thread Shawn Pearce
On Fri, Jul 14, 2017 at 1:08 PM, Jeff King  wrote:
> On Thu, Jul 13, 2017 at 05:11:52PM -0700, Shawn Pearce wrote:
>
> I think the "stack" implementation is what makes me most uncomfortable
> with this proposal. Atomicity with filesystem operations and especially
> readdir() is one of the things I think is most flaky about the current
> system. Here's an idea for an alternative implementation.
>
>   1. Write out reftables to files named after the hash of their content
>  (e.g., $GIT_DIR/reftables/1234abcd...).
>
>   2. The first block of the each reftable has a backpointer to the
>  previous table.
>
>   3. There's a well-known name (e.g., $GIT_DIR/reftable) that represents
>  the current state. We update it with the usual .lock/rename dance.
>
> That gives readers an easy atomic view; once they've called open() on
> "reftable", they follow back-pointers instead of computing the names
> (n-1, n-2, etc). They may still find that a compaction has removed a
> file they need, but:
>
>   - they only have to restart due to an actual compaction. They'll never
> be tricked by a regular update.
>
>   - you can compact without immediately deleting the old reftables. So
> you might compact, and then delete the reftables N seconds later. Any
> reader which completes the read within N seconds wouldn't have to
> restart.
>
> I think I can anticipate your answer, though. If you have a system where
> the latency to open and read a file is high, then you've just serialized
> the latencies as you walk the chain. Whereas with predictable names, you
> can pre-fetch refname.i through refname.j in parallel.
>
> How deep would you anticipate stacks getting? Would it be feasible for
> the tip to contain the names of the tables in the entire chain? If we're
> talking about 20 (or even 32) bytes per name, you could still fit over a
> hundred names in a 4K inode.

I think we'd want to keep the stacks under 32, which is a reasonable
amount of space used in the header of each reftable. I don't have this
yet in my updated document + implementation, but I'll look at trying
to add it over the next couple of days. Your idea to hold the explicit
list of the stack in each reftable makes for a very safe atomic reader
view.


> It doesn't escape me that I'm basically reinventing RefTree here, with
> reftables instead of tree objects. But I think breaking away from using
> real Git objects opens up a lot of efficiency tricks (like the prefix
> compression, and the parallel-fetch thing above). And it removes a lot
> of the gc complexity.

Yes, I agree. Also RefTree has trouble scaling due to the flat tree
object format. It depends on the user to sensibly break up the
reference space with '/' characters sprinkled about. This reftable
proposal does not suffer from that limitation, a user can use any
valid ref name structuring.


> So I'm worried about live-locking with a regular updater, not even a
> compacting one.

Ok. I think your idea of tracking an explicit list of the stack in the
top of every reftable solves this in a very neat way, so I'll look to
switch to that.


>> block  restart  size  lookup
>> 4k 16   29M90.2 usec
>> 8k 16   29M76.4 usec
>
> Thanks for these numbers. I was really thinking that blocks would be on
> the order of 4K (where you can see that the restarts help very little).
> For local disk that's a pretty reasonable size. For high-latency fetches
> to a specialized database, maybe not.
...
>> > One thing I didn't see is reflogs. They don't strictly have to be part
>> > of a ref-storage solution. But they do still consume at least one inode
>> > per ref in the current system. If you reflog everything, which in my
>> > opinion you should. Having an audit trail of ref updates is very useful
>> > for debugging (both bugs in Git, and trying to figure out what went
>> > wrong when a user goes berserk with "git push --prune").
>>
>> Yea, I glossed over that and ignored them. Mostly because one system
>> where I want to use reftable already has a separate database handling
>> the reflogs. In another (Gerrit Code Review), we disable reflogs for
>> the insane refs/changes/ namespace, as nearly every reference is
>> created once, and never modified.
>
> Even for created-once refs, I've found an audit trail of created when,
> by whom, using what program to be quite valuable.

Dave Borowitz agrees with you, Gerrit Code Review should be recording
reflog data, even for create-once refs. Its a limitation of the
current $GIT_DIR/logs that has forced it to be disabled. I'd really
like something like reftable, so we can record reflog.


>> One could abuse the reftable format to store a reflog. If the refname
>> field is just a sequence of bytes, one could store keys like refname +
>> '\0' + timestamp, and reuse the symbolic ref value format to store the
>> old/new/message, as its just a length delimited string.
>
> Gross, but it could work. I actually think David's LMDB 

Re: reftable: new ref storage format

2017-07-14 Thread Jeff King
On Thu, Jul 13, 2017 at 05:27:44PM -0700, Shawn Pearce wrote:

> > We _could_ consider gzipping individual blocks of
> > a reftable (or any structure that allows you to search to a
> > constant-sized block and do a linear search from there). But given that
> > they're in the same ballpark, I'm happy with whatever ends up the
> > simplest to code and debug. ;)
> 
> This does help to shrink the file, e.g. it drops from 28M to 23M.
> 
> It makes it more CPU costly to access a block, as we have to inflate
> that to walk through the records. It also messes with alignment. When
> you touch a block, that may be straddling two virtual memory pages in
> your kernel/filesystem.
> 
> I'm not sure those penalties are worth the additional 16% reduction in size.

Yeah, I don't really care about a 16% reduction in size. I care much
more about simplicity of implementation and debugging. Using zlib is
kind-of simple to implement. But if you've ever had to debug it (or
figure out what is going on with maybe-corrupted output), it's pretty
nasty.

So I don't mind a more readable custom compression if it's not too
complicated. And especially if it buys us extra performance by being
able to jump around non-sequentially in the block.

-Peff


Re: reftable: new ref storage format

2017-07-14 Thread Jeff King
On Thu, Jul 13, 2017 at 05:11:52PM -0700, Shawn Pearce wrote:

> Yes, I was hoping for reader atomicity. But I may OK foregoing that if
> the transaction either all goes through, or all fails. A partially
> stuck transaction because the process died in the middle of the commit
> step creates a mess for an administrator to undo. Does she rename
> "foo.lock" to "foo"? Or delete "foo.lock"?

Agreed, there's no real rollback or recovery process. I do think
shooting for reader atomicity is worth doing. Lack of atomicity can
cause odd things to happen with operations like pruning, for example. If
I'm trying to get a list of all of the reachable objects, for example, I
might have to readdir() a bunch of directories (let's forget even that a
single readdir() is not necessarily atomic). If I try to atomically move
"refs/heads/z/foo" to "refs/heads/a/foo" there is a reasonable chance
that a reader may see only the deletion and not the addition.

I don't have any known cases of this biting anyone, but it's somewhat
scary.

> > But I'm not sure if you meant to contrast here a system where we didn't
> > use packed-refs at all (though of course such a system is very much not
> > atomic by the definition above).
> 
> No, I really did mean the current system. Gerrit Code Review servers
> create a lot of references throughout the day. Its easy to accumulate
> a few thousand new loose references in a 24 hour period. Even if you
> have 600k existing refs in packed-refs, you still have 2k new/modified
> refs since last nightly cron ran git gc.

I do think you'd be better served by just calling pack-refs more
frequently, then. Nightly is too infrequent for a busy repo. And under
something like reftables, you'd end up doing the equivalent of a
pack-refs every N updates anyway.

We actually pack refs quite aggressively at GitHub. Way more than I
would consider reasonable, but it's never been a big bottleneck, so I've
never looked into it. We don't do it for every update, but every update
triggers a "consider syncing objects into shared storage" job, which
will pack the refs. So in a hypothetical repo that's constantly updating
we probably pack refs at least once a minute.

But we're generally on low-latency local disks. It sounds like you
emphatically are not.

> > Good goal.  Just to play devil's advocate, the simplest way to do that
> > with the current code would be to gzip packed-refs (and/or store sha1s
> > as binary). That works against the "mmap and binary search" plan,
> > though. :)
> 
> Yes it does. I tried to cover that later under "Alternatives
> considered > bzip". :)

Yeah, sorry, I read and responded to the document a bit out of order. I
agree it's a dead-end. :)

> >> Reference names should be encoded with UTF-8.
> >
> > Don't we usually treat refnames as byte sequences (subject to a few
> > rules, as in check_ref_format())? It seems like the encoding should be
> > out-of-scope for the storage format.
> 
> True that git-core treats them as byte sequences, but JGit treats them as 
> UTF-8.

I think we can probably stick with that, unless the UTF-8ness is really
important? I guess it might matter if it impacts the sorting order.

> Yes. My "near constant time" claim was because I had my head buried
> thinking about disk IO operations when I wrote this, not the algorithm
> that happens on the CPU. One of the applications for reftable is on a
> slower-than-usual storage system, where reading a block of a file
> costs enough milliseconds that it doesn't matter how good or bad the
> CPU algorithm is.

OK, that explains a lot of the decisions better. If you're planning on
evolving this proposal document, I think it would make sense to talk
about that in the objectives section. (I say "if" because I am happy for
the mailing list discussion to serve as a rationale document).

> But you say this yourself below, you can do this using a geometric
> scheme or something to bound the cost of these rewrites such that you
> aren't frequently rewriting the whole database.

Right, but that sounds like math. I wanted you to spoonfeed me the
geometric algorithm (and its bound proof). ;)

> >> Compaction is similar to the update process, but an explicit temporary
> >> file must be used:
> >>
> >> 1. Atomically create `$GIT_DIR/reftable.lock`.
> >> 2. `readdir($GIT_DIR)` to determine the highest suffix ordinal, `n`.
> >> 3. Compute the update transaction (e.g. compare expected values).
> >> 4. Select files from (2) to collapse as part of this transaction.
> >> 5. Create temp file by `mktemp("$GIT_DIR/.reftableXX")`.
> >> 6. Write modified and collapsed references to temp file.
> >> 7. Rename temp file to `reftable.${n + 1}`.
> >> 8. Delete collapsed files `reftable.${n}`, `reftable.${n - 1}`, ...
> >> 9. Delete `reftable.lock`.

I think the "stack" implementation is what makes me most uncomfortable
with this proposal. Atomicity with filesystem operations and especially
readdir() is one of the things I think is most flaky about the 

Re: reftable: new ref storage format

2017-07-14 Thread Shawn Pearce
On Fri, Jul 14, 2017 at 7:27 AM, Dave Borowitz  wrote:
> On Thu, Jul 13, 2017 at 8:11 PM, Shawn Pearce  wrote:
>> In another (Gerrit Code Review), we disable reflogs for
>> the insane refs/changes/ namespace, as nearly every reference is
>> created once, and never modified.
>
> Apologies for the tangent, but this is not true in the most recent
> Gerrit implementation. We update refs/changes/CD/ABCD/1 and
> refs/changes/CD/ABCD/meta in a single BatchRefUpdate, and we set a
> reflog message on the BatchRefUpdate instance, which updates the
> reflog for all refs in the batch. The reflog message on /meta is
> important, and arguably it's useful to be able to correlate that with
> the reflog on /1.
>
> If you think storing reflogs on patch set refs is going to be a
> problem wrt on-disk storage, we should discuss this offline :)

Reflog storage is a problem for Gerrit. It was a problem in early 2009
when servers had a lot less changes. Its going to be even more of a
problem now. Sounds like we have to support reflogs in reftable, or
something like it.


Re: reftable: new ref storage format

2017-07-14 Thread Dave Borowitz
On Thu, Jul 13, 2017 at 8:11 PM, Shawn Pearce  wrote:
> In another (Gerrit Code Review), we disable reflogs for
> the insane refs/changes/ namespace, as nearly every reference is
> created once, and never modified.

Apologies for the tangent, but this is not true in the most recent
Gerrit implementation. We update refs/changes/CD/ABCD/1 and
refs/changes/CD/ABCD/meta in a single BatchRefUpdate, and we set a
reflog message on the BatchRefUpdate instance, which updates the
reflog for all refs in the batch. The reflog message on /meta is
important, and arguably it's useful to be able to correlate that with
the reflog on /1.

If you think storing reflogs on patch set refs is going to be a
problem wrt on-disk storage, we should discuss this offline :)


Re: reftable: new ref storage format

2017-07-13 Thread Shawn Pearce
On Thu, Jul 13, 2017 at 1:35 PM, Jeff King  wrote:
> On Thu, Jul 13, 2017 at 12:56:54PM -0700, Stefan Beller wrote:
>
> I agree that a full binary search of a reftable is harder because of the
> prefix compression (it may still be possible by scanning backwards, but
> I think there are ambiguities when you land in the middle of a record,
> since there's no unambiguous end-of-record character).

Its impossible to safely binary search this reftable format using a
naive divide byte count in half and find record boundary approach. I
actually did design an earlier version of reftable that was safe to
use this approach for its binary search within blocks, and wound up
discarding it. It was slower and more complex implementation than the
format I shared with the list.


> But I don't think
> it matters. If you binary-search to a constant-sized block, then a
> linear scan of the block is acceptable.

Depends on the block size. :)


> Not that I'm recommending just gzipping the whole packed-refs file. It
> ruins the fast-lookup.

As I just mentioned elsewhere in the thread:

  src file65306185
  gzip28338906
  reftable  28782292

The reftable format (for 64k block, 256 restart) is within spitting
distance (432 KiB) of a default level gzip of packed-refs. We can get
fast-lookup, and OK compression.


> We _could_ consider gzipping individual blocks of
> a reftable (or any structure that allows you to search to a
> constant-sized block and do a linear search from there). But given that
> they're in the same ballpark, I'm happy with whatever ends up the
> simplest to code and debug. ;)

This does help to shrink the file, e.g. it drops from 28M to 23M.

It makes it more CPU costly to access a block, as we have to inflate
that to walk through the records. It also messes with alignment. When
you touch a block, that may be straddling two virtual memory pages in
your kernel/filesystem.

I'm not sure those penalties are worth the additional 16% reduction in size.


>> When Shawn presented the proposal, a couple of colleagues here
>> were as excited as I was, but the daring question is, why Shawn
>> did not give the whole thing in BNF format from top down:
>>
>>   initial-block
>>   content-blocks*
>>   (index-block)
>>   footer
>
> Yeah, I agree it took me a bit to figure out what was going on. A
> high-level overview of the format would have been nice.

Noted, I've added this to my writeup.


Re: reftable: new ref storage format

2017-07-13 Thread Shawn Pearce
On Thu, Jul 13, 2017 at 12:32 PM, Jeff King  wrote:
> On Wed, Jul 12, 2017 at 05:17:58PM -0700, Shawn Pearce wrote:
>
>> ### Problem statement
>>
>> Some repositories contain a lot of references (e.g.  android at 866k,
>> rails at 31k).  The existing packed-refs format takes up a lot of
>> space (e.g.  62M), and does not scale with additional references.
>> Lookup of a single reference requires linearly scanning the file.
>
> I think the linear scan is actually an implementation short-coming. Even
> though the records aren't fixed-length, the fact that newlines can only
> appear as end-of-record is sufficient to mmap and binary search a
> packed-refs file (you just have to backtrack a little when you land in
> the middle of a record).
>
> I wrote a proof of concept a while ago, but got stuck on integrating it
> into the ref code, because of some of the assumptions that it made.
> Michael Haggerty has been doing several rounds of refactors to remove
> those assumptions. I think we're pretty close (I've actually seen the
> endgame where packed-refs is fully binary searched, but I think there
> are a few more cleanups necessary to cover all cases).

You are correct, this is possible with the current packed-refs format.
It just hasn't materialized in a shipping implementation yet.


>> Atomic pushes modifying multiple references require copying the
>> entire packed-refs file, which can be a considerable amount of data
>> moved (e.g. 62M in, 62M out) for even small transactions (2 refs
>> modified).
>
> I think your definition of atomic here doesn't match what git.git does.

:-(


> Our atomic push just takes the lock on all of the refs, and then once it
> has all of them, commits all of the locks. So it's atomic in the sense
> that you either get all or none of the writes (modulo a commit failure
> in the middle, which we naturally have no rollback plan for). But it can
> be done without touching the packed-refs file at all.
>
> I imagine that you're looking at atomicity from the perspective of a
> reader. In the git.git scheme, the reader may see a half-committed
> transaction. If you dispense with loose refs entirely and treat the
> packed-refs file as a single poorly-implemented key/value database, then
> you get reader atomicity (but O(size_of_database) write performance).

Yes, I was hoping for reader atomicity. But I may OK foregoing that if
the transaction either all goes through, or all fails. A partially
stuck transaction because the process died in the middle of the commit
step creates a mess for an administrator to undo. Does she rename
"foo.lock" to "foo"? Or delete "foo.lock"?


>> Repositories with many loose references occupy a large number of disk
>> blocks from the local file system, as each reference is its own file
>> storing 41 bytes.  This negatively affects the number of inodes
>> available when a large number of repositories are stored on the same
>> filesystem.  Readers are also penalized due to the larger number of
>> syscalls required to traverse and read the `$GIT_DIR/refs` directory.
>
> In my experience, the syscalls involved in loose refs aren't usually a
> big deal. If you have 800k refs, they're not all changing constantly. So
> a single pack-refs "fixes" performance going forward. What _is_ a big
> deal is that the packing process is complicated, readers have a very
> un-atomic view because of the myriad of files involved, and you get
> annoying lock contention during packing, as well as between deletions
> that have to rewrite packed-refs.
>
> But I'm not sure if you meant to contrast here a system where we didn't
> use packed-refs at all (though of course such a system is very much not
> atomic by the definition above).

No, I really did mean the current system. Gerrit Code Review servers
create a lot of references throughout the day. Its easy to accumulate
a few thousand new loose references in a 24 hour period. Even if you
have 600k existing refs in packed-refs, you still have 2k new/modified
refs since last nightly cron ran git gc.


>> ### Objectives
>>
>> - Near constant time lookup for any single reference, even when the
>>   repository is cold and not in process or kernel cache.
>
> Good goal, though TBH I'd be happy with O(log n).
>
> A related one is being able to traverse a subset of refs in
> O(nr_traversed). E.g., "git tag -l" should not have to do work
> proportional to what is in refs/changes. That falls out of most
> proposals that allow fast lookups, but notably not a straight
> hash-table.

Thanks, I missed that in this list, even though it was an explicit
objective going into this work. I added:

- Efficient lookup of an entire namespace, such as `refs/tags/`.


>> - Occupy less disk space for large repositories.
>
> Good goal.  Just to play devil's advocate, the simplest way to do that
> with the current code would be to gzip packed-refs (and/or store sha1s
> as binary). That works against the "mmap and binary search" plan,
> though. :)

Yes 

Re: reftable: new ref storage format

2017-07-13 Thread Eric Wong
Jeff King  wrote:
> I agree that a full binary search of a reftable is harder because of the
> prefix compression (it may still be possible by scanning backwards, but
> I think there are ambiguities when you land in the middle of a record,
> since there's no unambiguous end-of-record character). But I don't think
> it matters. If you binary-search to a constant-sized block, then a
> linear scan of the block is acceptable.

For a new packed-refs, I think an intrusive critbit tree would
be a good way to store refs which have many common prefixes and
I've always wanted to apply critbit to an on-disk storage
format...

Several years ago, I started writing one in Perl using
pwrite/pread to provide Message-ID <=> NNTP article number
mapping several years ago, but gave up on it out of laziness:

  https://80x24.org/spew/1441508596-19511-1-git-send-emai...@80x24.org/raw

The end goal would've been to have two tries sharing the same
storage struct: one keyed by Message-ID, the other keyed by NNTP
article number (and figuring out the node using offsets like
we do with (container_of|list_entry) in list.h.

For git, being able to do an O(hashlength) prefix search based
on the object_id from the reftable would speed up decorations, I
think.  And of course, the O(refnamelength) prefix search would
also apply to the refnames themselves.


Re: reftable: new ref storage format

2017-07-13 Thread Jeff King
On Thu, Jul 13, 2017 at 12:56:54PM -0700, Stefan Beller wrote:

> >> ### Problem statement
> >>
> >> Some repositories contain a lot of references (e.g.  android at 866k,
> >> rails at 31k).  The existing packed-refs format takes up a lot of
> >> space (e.g.  62M), and does not scale with additional references.
> >> Lookup of a single reference requires linearly scanning the file.
> >
> > I think the linear scan is actually an implementation short-coming. Even
> > though the records aren't fixed-length, the fact that newlines can only
> > appear as end-of-record is sufficient to mmap and binary search a
> > packed-refs file (you just have to backtrack a little when you land in
> > the middle of a record).
> 
> Except that a record is a "delta" to the previous record, so it's not
> just finding a record, but reconstructing it. Example for records:

I was still talking about the existing packed-refs implementation here.

I agree that a full binary search of a reftable is harder because of the
prefix compression (it may still be possible by scanning backwards, but
I think there are ambiguities when you land in the middle of a record,
since there's no unambiguous end-of-record character). But I don't think
it matters. If you binary-search to a constant-sized block, then a
linear scan of the block is acceptable.

> >> - Occupy less disk space for large repositories.
> >
> > Good goal.  Just to play devil's advocate, the simplest way to do that
> > with the current code would be to gzip packed-refs (and/or store sha1s
> > as binary). That works against the "mmap and binary search" plan,
> > though. :)
> 
> Given the compression by delta-ing the name to the previous change and
> the fact that Gerrit has
> 
>   refs/heads/changes/1
>   refs/heads/changes/2
>   refs/heads/changes/3
>   ...
> 
> I think this format would trump a "dumb" zip.
> (Github having sequentially numbered pull requests would also
> benefit here)

You may be surprised. Let's imagine that you have a set of 4096 refs in
refs/changes/1, refs/changes/2, etc:

  for i in $(seq 1 4096)
  do
echo refs/changes/$i
  done >input

Now let's do a prefix compression, with a single byte for "how many
characters to reuse from the last entry":

  perl -lne '
my $common;
if (defined $last) {
  chop $last while !/\Q$last\E/;
  $common = length($last);
} else {
  $common = 0;
}
print chr($common), substr($_, $common);
$last = $_;
  ' prefix

And a gzip:

  gzip -c -9 zip

And the results:

  $ wc -c prefix; wc -c zip
  12754 prefix
  10116 zip

The secret sauce is most likely that gzip is bit-packing, using only a
few bits per character and not aligning with byte boundaries.

Not that I'm recommending just gzipping the whole packed-refs file. It
ruins the fast-lookup. We _could_ consider gzipping individual blocks of
a reftable (or any structure that allows you to search to a
constant-sized block and do a linear search from there). But given that
they're in the same ballpark, I'm happy with whatever ends up the
simplest to code and debug. ;)

Just for fun, here's the decoding script for the prefix-compression:

  perl -e '
while (read(STDIN, $common, 1)) {
  $common = ord($common);
  $rest = ;
  if ($common > 0) {
$rest = substr($last, 0, $common) . $rest
  }
  print $rest;
  $last = $rest}'  > OK, let me try to summarize to see if I understand.
> 
> When Shawn presented the proposal, a couple of colleagues here
> were as excited as I was, but the daring question is, why Shawn
> did not give the whole thing in BNF format from top down:
> 
>   initial-block
>   content-blocks*
>   (index-block)
>   footer

Yeah, I agree it took me a bit to figure out what was going on. A
high-level overview of the format would have been nice.

> >  So lookup really is more
> > like O(block_size * log(n/block_size)), but block_size being a constant,
> > it drops out to O(log n).
> 
> There is also an index block such that you can binary search across
> blocks, so
> 
> O( log(block_count) + log(intra_block_restarting_points) + small linear scan)
> 
> There are 2 binary searches, and the block size is an interesting
> thing to look at when making up trade offs.

Right, the cross-block index was what I was trying to account for.
Either way, from a big-O perspective the block size and the number of
restarts are constants with respect to the total number of entries. I'm
happy with log(n), though. It's hard to do better.

-Peff


Re: reftable: new ref storage format

2017-07-13 Thread Stefan Beller
On Thu, Jul 13, 2017 at 12:32 PM, Jeff King  wrote:
> On Wed, Jul 12, 2017 at 05:17:58PM -0700, Shawn Pearce wrote:
>
>> We've been having scaling problems with insane number of references
>> (>866k), so I started thinking a lot about improving ref storage.
>>
>> I've written a simple approach, and implemented it in JGit.
>> Performance is promising:
>>
>>   - 62M packed-refs compresses to 27M
>>   - 42.3 usec lookup
>
> Exciting. I'd love for us to have a simple-ish on-disk structure that
> scales well and doesn't involve a dependency on a third-party database
> structure.
>
> Let me see what holes I can poke in your proposal, though. :)
>
>> ### Problem statement
>>
>> Some repositories contain a lot of references (e.g.  android at 866k,
>> rails at 31k).  The existing packed-refs format takes up a lot of
>> space (e.g.  62M), and does not scale with additional references.
>> Lookup of a single reference requires linearly scanning the file.
>
> I think the linear scan is actually an implementation short-coming. Even
> though the records aren't fixed-length, the fact that newlines can only
> appear as end-of-record is sufficient to mmap and binary search a
> packed-refs file (you just have to backtrack a little when you land in
> the middle of a record).

Except that a record is a "delta" to the previous record, so it's not
just finding a record, but reconstructing it. Example for records:

varint( prefix_length )
varint( (suffix_length << 2) | type )
suffix
value?

First record:

 0,
 16 << 2 | 0x01,
  "refs/heads/maint"
  08f9c32463bf9e578acb7ac5f77afd36e803c6bc

next record (refs/heads/master):

  13
  4 << 2 | 0x01
  "ster",
  80145b1e412719c960036c8c62a9e35dd23a5b2d

Now if you found the second one, you cannot reconstruct its
real name (refs/heads/master) without knowing the name
of the first. The name of the first is easy because the prefix_length
is 0. If it also had a prefix length != 0 you'd have to go back more.


>> - Occupy less disk space for large repositories.
>
> Good goal.  Just to play devil's advocate, the simplest way to do that
> with the current code would be to gzip packed-refs (and/or store sha1s
> as binary). That works against the "mmap and binary search" plan,
> though. :)

Given the compression by delta-ing the name to the previous change and
the fact that Gerrit has

  refs/heads/changes/1
  refs/heads/changes/2
  refs/heads/changes/3
  ...

I think this format would trump a "dumb" zip.
(Github having sequentially numbered pull requests would also
benefit here)

>> ## File format
>
> OK, let me try to summarize to see if I understand.

When Shawn presented the proposal, a couple of colleagues here
were as excited as I was, but the daring question is, why Shawn
did not give the whole thing in BNF format from top down:

  initial-block
  content-blocks*
  (index-block)
  footer

> The reftable file is a sequence of blocks, each of which contains a
> finite set of heavily-compressed refs. You have to read each block
> sequentially,

Each block may have restarting points, that allow for intra-block
binary search.

> but since they're a fixed size, that's still a
> constant-time operation (I'm ignoring the "restarts" thing for now). You
> find the right block by reading the index.

or by reading the footer at the end. If the footer and the index differ
in block size (one bit flipped), we can ask the CRC of the footer
for more guidance.

>  So lookup really is more
> like O(block_size * log(n/block_size)), but block_size being a constant,
> it drops out to O(log n).

There is also an index block such that you can binary search across
blocks, so

O( log(block_count) + log(intra_block_restarting_points) + small linear scan)

There are 2 binary searches, and the block size is an interesting
thing to look at when making up trade offs.


Re: reftable: new ref storage format

2017-07-13 Thread Jeff King
On Wed, Jul 12, 2017 at 05:17:58PM -0700, Shawn Pearce wrote:

> We've been having scaling problems with insane number of references
> (>866k), so I started thinking a lot about improving ref storage.
> 
> I've written a simple approach, and implemented it in JGit.
> Performance is promising:
> 
>   - 62M packed-refs compresses to 27M
>   - 42.3 usec lookup

Exciting. I'd love for us to have a simple-ish on-disk structure that
scales well and doesn't involve a dependency on a third-party database
structure.

Let me see what holes I can poke in your proposal, though. :)

> ### Problem statement
> 
> Some repositories contain a lot of references (e.g.  android at 866k,
> rails at 31k).  The existing packed-refs format takes up a lot of
> space (e.g.  62M), and does not scale with additional references.
> Lookup of a single reference requires linearly scanning the file.

I think the linear scan is actually an implementation short-coming. Even
though the records aren't fixed-length, the fact that newlines can only
appear as end-of-record is sufficient to mmap and binary search a
packed-refs file (you just have to backtrack a little when you land in
the middle of a record).

I wrote a proof of concept a while ago, but got stuck on integrating it
into the ref code, because of some of the assumptions that it made.
Michael Haggerty has been doing several rounds of refactors to remove
those assumptions. I think we're pretty close (I've actually seen the
endgame where packed-refs is fully binary searched, but I think there
are a few more cleanups necessary to cover all cases).

> Atomic pushes modifying multiple references require copying the
> entire packed-refs file, which can be a considerable amount of data
> moved (e.g. 62M in, 62M out) for even small transactions (2 refs
> modified).

I think your definition of atomic here doesn't match what git.git does.

Our atomic push just takes the lock on all of the refs, and then once it
has all of them, commits all of the locks. So it's atomic in the sense
that you either get all or none of the writes (modulo a commit failure
in the middle, which we naturally have no rollback plan for). But it can
be done without touching the packed-refs file at all.

I imagine that you're looking at atomicity from the perspective of a
reader. In the git.git scheme, the reader may see a half-committed
transaction. If you dispense with loose refs entirely and treat the
packed-refs file as a single poorly-implemented key/value database, then
you get reader atomicity (but O(size_of_database) write performance).

> Repositories with many loose references occupy a large number of disk
> blocks from the local file system, as each reference is its own file
> storing 41 bytes.  This negatively affects the number of inodes
> available when a large number of repositories are stored on the same
> filesystem.  Readers are also penalized due to the larger number of
> syscalls required to traverse and read the `$GIT_DIR/refs` directory.

In my experience, the syscalls involved in loose refs aren't usually a
big deal. If you have 800k refs, they're not all changing constantly. So
a single pack-refs "fixes" performance going forward. What _is_ a big
deal is that the packing process is complicated, readers have a very
un-atomic view because of the myriad of files involved, and you get
annoying lock contention during packing, as well as between deletions
that have to rewrite packed-refs.

But I'm not sure if you meant to contrast here a system where we didn't
use packed-refs at all (though of course such a system is very much not
atomic by the definition above).


> ### Objectives
> 
> - Near constant time lookup for any single reference, even when the
>   repository is cold and not in process or kernel cache.

Good goal, though TBH I'd be happy with O(log n).

A related one is being able to traverse a subset of refs in
O(nr_traversed). E.g., "git tag -l" should not have to do work
proportional to what is in refs/changes. That falls out of most
proposals that allow fast lookups, but notably not a straight
hash-table.

> - Occupy less disk space for large repositories.

Good goal.  Just to play devil's advocate, the simplest way to do that
with the current code would be to gzip packed-refs (and/or store sha1s
as binary). That works against the "mmap and binary search" plan,
though. :)

> - Support atomic pushes with lower copying penalities.

"Lower" is vague. I'd hope we could do updates with effort linear to the
number of updated refs (it's OK if there's a constant factor, like
writing extra blocks, as long as a single-ref update is about as
expensive in a 10-ref repo as in a 10K-ref repo).

> Scan (read 866k refs) and lookup (single ref from 866k refs):
> 
> format  | scan| lookup
> |:|---:
> packed-refs |  380 ms | 375420.0 usec
> reftable|  125 ms | 42.3 usec

Don't forget in git.git that the memory usage for packed-refs is
atrocious (because we parse 

reftable: new ref storage format

2017-07-12 Thread Shawn Pearce
We've been having scaling problems with insane number of references
(>866k), so I started thinking a lot about improving ref storage.

I've written a simple approach, and implemented it in JGit.
Performance is promising:

  - 62M packed-refs compresses to 27M
  - 42.3 usec lookup

You can read a rendered version of this here:
https://googlers.googlesource.com/sop/jgit/+/reftable/Documentation/technical/reftable.md


## Overview

### Problem statement

Some repositories contain a lot of references (e.g.  android at 866k,
rails at 31k).  The existing packed-refs format takes up a lot of
space (e.g.  62M), and does not scale with additional references.
Lookup of a single reference requires linearly scanning the file.

Atomic pushes modifying multiple references require copying the
entire packed-refs file, which can be a considerable amount of data
moved (e.g. 62M in, 62M out) for even small transactions (2 refs
modified).

Repositories with many loose references occupy a large number of disk
blocks from the local file system, as each reference is its own file
storing 41 bytes.  This negatively affects the number of inodes
available when a large number of repositories are stored on the same
filesystem.  Readers are also penalized due to the larger number of
syscalls required to traverse and read the `$GIT_DIR/refs` directory.

### Objectives

- Near constant time lookup for any single reference, even when the
  repository is cold and not in process or kernel cache.
- Occupy less disk space for large repositories.
- Support atomic pushes with lower copying penalities.

### Description

A reftable file is a portable binary file format customized for
reference storage. References are sorted, enabling linear scans,
binary search lookup, and range scans.

Storage in the file is organized into blocks.  Prefix compression
is used within a single block to reduce disk space.  Block size is
tunable by the writer.

### Performance

Space used, packed-refs vs. reftable:

repository | packed-refs | reftable | % original | avg ref
---|:|-:|---:|-:
android|  62.2 M |   27.7 M | 44.4%  | 33 bytes
rails  |   1.8 M |  896.2 K | 47.6%  | 29 bytes
git|  78.7 K |   27.9 K | 40.0%  | 43 bytes
git (heads)|   332 b |204 b | 61.4%  | 34 bytes

Scan (read 866k refs) and lookup (single ref from 866k refs):

format  | scan| lookup
|:|---:
packed-refs |  380 ms | 375420.0 usec
reftable|  125 ms | 42.3 usec

## Details

### Peeling

References in a reftable are always peeled.

### Reference name encoding

Reference names should be encoded with UTF-8.

### Ordering

Blocks are lexicographically ordered by their first reference.


## File format

### Header

A 8-byte header appears at the beginning of each file:

- 4-byte magic is: `\'1', 'R', 'E', 'F'`
- 1-byte version number, `1`.
- 3-byte `block_size` in bytes (network byte order).

### Block size

The `block_size` is arbitrarily determined by the writer, and does not
have to be a power of 2.  The block size must be larger than the
longest reference name used in the repository, as references cannot
span blocks.

### First block

The first block shares the same block as the file header, and is 8
bytes smaller than all other blocks in the file.  The first block
immediately begins after the file header, at offset 8.

### Block format

A block is written as:

ref_record*
padding?
int32( restart_offset )*
int32( record_end_offset )
int32( number_of_restarts )

Blocks begin with a variable number of `ref_record`, describing
reference names and values. The format is described below.

The middle of the record may be filled with `padding` NUL bytes to
fill out the block to the common `block_size` as specified in the file
header.  Padding may be necessary to ensure `number_of_restarts`
occupies the last 4 bytes of the block.  Padding may be omitted if the
block is the last block of the file, and there is no index block.
This allows reftable to efficiently scale down to a small number of
refs.

A variable number of 4-byte, network byte order `restart_offset`
values follows the padding.  Offsets are relative to the start of the
block and refer to the first byte of any `ref_record` whose name has
not been prefixed compressed.  Readers can start linear scans from any
of these records.

The 4-byte, network byte order `record_end_offset` follows, providing
the block-relative offset after the end of the last `ref_record`.  If
`padding` is present this is the offset of the first byte of padding,
or the first byte of the first `restart_offset` entry.

The 4-byte, network byte order `number_of_restarts` stores the number
of entries in the `restart_offset` list.  Readers can use the restart
count to binary search between restarts before starting a linear scan.
This field must be the last 4 bytes of the block; the `padding` field
must be used to