Re: [RFC PATCH 01/18] docs: Multi-Pack Index (MIDX) Design Notes

2018-01-08 Thread Jonathan Tan
On Mon, 8 Jan 2018 15:35:59 -0500
Derrick Stolee  wrote:

> Thanks! That is certainly the idea. If you know about MIDX, then you can 
> benefit from it. If you do not, then you have all the same data 
> available to you do to your work. Having a MIDX file will not break 
> other tools (libgit2, JGit, etc.).
> 
> One thing I'd like to determine before this patch goes to v1 is how much 
> we should make the other packfile-aware commands also midx-aware. My gut 
> reaction right now is to have git-repack call 'git midx --clear' if 
> core.midx=true and a packfile was deleted. However, this could easily be 
> changed with 'git midx --clear' followed by 'git midx --write 
> --update-head' if midx-head exists.

My opinion is that these are sufficient:
 (a) functionality to create a .midx file from scratch (deleting any
 existing ones)
 (b) either:
 - update of packfile.c to read (one or more) midx files (like in
   patch 18), and possibly usage in a benchmark, or
 - any other usage of midx file (e.g. abbreviations, like in patch
   17)

In general, a way to create them (so that they can be used from a
cronjob, etc.), and a way to consume them to show that the new way works
and is indeed faster. This functionality in itself might be sufficient
to merge in - this would already be useful in situations where it is
undesirable for repacks to happen often, and we can "bridge" the
intervals between repacks using a cronjob that periodically generates
.midx files in order to keep up the object lookup performance.

In particular, I think that it is fine to omit a more sophisticated
"repack" for now - the .midx mechanism must tolerate packs referenced by
.midx files suddenly disappearing anyway, and in this way, at least we
can demonstrate that the .midx mechanism still works in this case.


Re: [RFC PATCH 01/18] docs: Multi-Pack Index (MIDX) Design Notes

2018-01-08 Thread Derrick Stolee

On 1/8/2018 2:32 PM, Jonathan Tan wrote:

On Sun,  7 Jan 2018 13:14:42 -0500
Derrick Stolee  wrote:


+Design Details
+--
+
+- The MIDX file refers only to packfiles in the same directory
+  as the MIDX file.
+
+- A special file, 'midx-head', stores the hash of the latest
+  MIDX file so we can load the file without performing a dirstat.
+  This file is especially important with incremental MIDX files,
+  pointing to the newest file.

I presume that the actual MIDX files are named by hash? (You might have
written this somewhere that I somehow missed.)

Also, I notice that in the "Future Work" section, the possibility of
multiple MIDX files is raised. Could this 'midx-head' file be allowed to
store multiple such files? That way, we avoid a bit of file format
churn (in that we won't need to define a new "chunk" in the future).


I hadn't considered this idea, and I like it. I'm not sure this is a 
robust solution, since isolated MIDX files don't contain information 
that they could use other MIDX files, or what order they should be in. I 
think the "order" of incremental MIDX files is important in a few ways 
(such as the "stable object order" idea).


I will revisit this idea when I come back with the incremental MIDX 
feature. For now, the only reference to "number of base MIDX files" is 
in one byte of the MIDX header. We should consider changing that byte 
for this patch.



+- If a packfile exists in the pack directory but is not referenced
+  by the MIDX file, then the packfile is loaded into the packed_git
+  list and Git can access the objects as usual. This behavior is
+  necessary since other tools could add packfiles to the pack
+  directory without notifying Git.
+
+- The MIDX file should be only a supplemental structure. If a
+  user downgrades or disables the `core.midx` config setting,
+  then the existing .idx and .pack files should be sufficient
+  to operate correctly.

Let me try to summarize: so, at this point, there are no
backwards-incompatible changes to the repo disk format. Unupdated code
paths (and old versions of Git) can just read the .idx and .pack files,
as always. Updated code paths will look at the .midx and .idx files, and
will sort them as follows:
  - .midx files go into a data structure
  - .idx files not referenced by any .midx files go into the
existing packed_git data structure

A writer can either merely write a new packfile (like old versions of
Git) or write a packfile and update the .midx file, and everything above
will still work. In the event that a writer deletes an existing packfile
referenced by a .midx (for example, old versions of Git during a
repack), we will lose the advantages of the .midx file - we will detect
that the .midx no longer works when attempting to read an object given
its information, but in this case, we can recover by dropping the .midx
file and loading all the .idx files it references that still exist.

As a reviewer, I think this is a very good approach, and this does make
things easier to review (as opposed to, say, an approach where a lot of
the code must be aware of .midx files).


Thanks! That is certainly the idea. If you know about MIDX, then you can 
benefit from it. If you do not, then you have all the same data 
available to you do to your work. Having a MIDX file will not break 
other tools (libgit2, JGit, etc.).


One thing I'd like to determine before this patch goes to v1 is how much 
we should make the other packfile-aware commands also midx-aware. My gut 
reaction right now is to have git-repack call 'git midx --clear' if 
core.midx=true and a packfile was deleted. However, this could easily be 
changed with 'git midx --clear' followed by 'git midx --write 
--update-head' if midx-head exists.


Thanks,
-Stolee


Re: [RFC PATCH 01/18] docs: Multi-Pack Index (MIDX) Design Notes

2018-01-08 Thread Jonathan Tan
On Sun,  7 Jan 2018 13:14:42 -0500
Derrick Stolee  wrote:

> +Design Details
> +--
> +
> +- The MIDX file refers only to packfiles in the same directory
> +  as the MIDX file.
> +
> +- A special file, 'midx-head', stores the hash of the latest
> +  MIDX file so we can load the file without performing a dirstat.
> +  This file is especially important with incremental MIDX files,
> +  pointing to the newest file.

I presume that the actual MIDX files are named by hash? (You might have
written this somewhere that I somehow missed.)

Also, I notice that in the "Future Work" section, the possibility of
multiple MIDX files is raised. Could this 'midx-head' file be allowed to
store multiple such files? That way, we avoid a bit of file format
churn (in that we won't need to define a new "chunk" in the future).

> +- If a packfile exists in the pack directory but is not referenced
> +  by the MIDX file, then the packfile is loaded into the packed_git
> +  list and Git can access the objects as usual. This behavior is
> +  necessary since other tools could add packfiles to the pack
> +  directory without notifying Git.
> +
> +- The MIDX file should be only a supplemental structure. If a
> +  user downgrades or disables the `core.midx` config setting,
> +  then the existing .idx and .pack files should be sufficient
> +  to operate correctly.

Let me try to summarize: so, at this point, there are no
backwards-incompatible changes to the repo disk format. Unupdated code
paths (and old versions of Git) can just read the .idx and .pack files,
as always. Updated code paths will look at the .midx and .idx files, and
will sort them as follows:
 - .midx files go into a data structure
 - .idx files not referenced by any .midx files go into the
   existing packed_git data structure

A writer can either merely write a new packfile (like old versions of
Git) or write a packfile and update the .midx file, and everything above
will still work. In the event that a writer deletes an existing packfile
referenced by a .midx (for example, old versions of Git during a
repack), we will lose the advantages of the .midx file - we will detect
that the .midx no longer works when attempting to read an object given
its information, but in this case, we can recover by dropping the .midx
file and loading all the .idx files it references that still exist.

As a reviewer, I think this is a very good approach, and this does make
things easier to review (as opposed to, say, an approach where a lot of
the code must be aware of .midx files).


[RFC PATCH 01/18] docs: Multi-Pack Index (MIDX) Design Notes

2018-01-07 Thread Derrick Stolee
Commentary: This file format uses the large offsets from the pack-index
version 2 format, but drops the CRC32 hashes from that format.

Also: I included the HASH footer at the end only because it is already in
the pack and pack-index formats, but not because it is particularly useful
here. If possible, I'd like to remove it and speed up MIDX writes.

-- >8 --

Add a technical documentation file describing the design
for the multi-pack index (MIDX). Includes current limitations
and future work.

Signed-off-by: Derrick Stolee 
---
 Documentation/technical/multi-pack-index.txt | 149 +++
 1 file changed, 149 insertions(+)
 create mode 100644 Documentation/technical/multi-pack-index.txt

diff --git a/Documentation/technical/multi-pack-index.txt 
b/Documentation/technical/multi-pack-index.txt
new file mode 100644
index 00..d31b03dec5
--- /dev/null
+++ b/Documentation/technical/multi-pack-index.txt
@@ -0,0 +1,149 @@
+Multi-Pack-Index (MIDX) Design Notes
+
+
+The Git object directory contains a 'pack' directory containing
+packfiles (with suffix ".pack") and pack-indexes (with suffix
+".idx"). The pack-indexes provide a way to lookup objects and
+navigate to their offset within the pack, but these must come
+in pairs with the packfiles. This pairing depends on the file
+names, as the pack-index differs only in suffix with its pack-
+file. While the pack-indexes provide fast lookup per packfile,
+this performance degrades as the number of packfiles increases,
+because abbreviations need to inspect every packfile and we are
+more likely to have a miss on our most-recently-used packfile.
+For some large repositories, repacking into a single packfile
+is not feasible due to storage space or excessive repack times.
+
+The multi-pack-index (MIDX for short, with suffix ".midx")
+stores a list of objects and their offsets into multiple pack-
+files. It contains:
+
+- A list of packfile names.
+- A sorted list of object IDs.
+- A list of metadata for the ith object ID including:
+  - A value j referring to the jth packfile.
+  - An offset within the jth packfile for the object.
+- If large offsets are required, we use another list of large
+  offsets similar to version 2 pack-indexes.
+
+Thus, we can provide O(log N) lookup time for any number
+of packfiles.
+
+A new config setting 'core.midx' must be enabled before writing
+or reading MIDX files.
+
+The MIDX files are updated by the 'midx' builtin with the
+following common parameter combinations:
+
+- 'git midx' gives the hash of the current MIDX head.
+- 'git midx --write --update-head --delete-expired' writes a new
+  MIDX file, points the MIDX head to that file, and deletes the
+  existing MIDX file if out-of-date.
+- 'git midx --read' lists some basic information about the current
+  MIDX head. Used for basic tests.
+- 'git midx --clear' deletes the current MIDX head.
+
+Design Details
+--
+
+- The MIDX file refers only to packfiles in the same directory
+  as the MIDX file.
+
+- A special file, 'midx-head', stores the hash of the latest
+  MIDX file so we can load the file without performing a dirstat.
+  This file is especially important with incremental MIDX files,
+  pointing to the newest file.
+
+- If a packfile exists in the pack directory but is not referenced
+  by the MIDX file, then the packfile is loaded into the packed_git
+  list and Git can access the objects as usual. This behavior is
+  necessary since other tools could add packfiles to the pack
+  directory without notifying Git.
+
+- The MIDX file should be only a supplemental structure. If a
+  user downgrades or disables the `core.midx` config setting,
+  then the existing .idx and .pack files should be sufficient
+  to operate correctly.
+
+- The file format includes parameters for the object id length
+  and hash algorithm, so a future change of hash algorithm does
+  not require a change in format.
+
+- If an object appears in multiple packfiles, then only one copy
+  is stored in the MIDX. This has a possible performance issue:
+  If an object appears as the delta-base of multiple objects from
+  multiple packs, then cross-pack delta calculations may slow down.
+  This is currently only theoretical and has not been demonstrated
+  to be a measurable issue.
+
+Current Limitations
+---
+
+- MIDX files are managed only by the midx builtin and is not
+  automatically updated on clone or fetch.
+
+- There is no '--verify' option for the midx builtin to verify
+  the contents of the MIDX file against the pack contents.
+
+- Constructing a MIDX file currently requires the single-pack
+  index for every pack being added to the MIDX.
+
+- The fsck builtin does not check MIDX files, but should.
+
+- The repack builtin is not aware of the MIDX files, and may
+  invalidate the MIDX files by deleting existing packfiles. The
+  MIDX may also be extended in the future to store metadata about
+  a packfile that ca