subject:"Git pack v4\: next step, help required"

Re: Git pack v4: next step, help required

2013-09-18 Thread Duy Nguyen

On Thu, Sep 19, 2013 at 12:40 AM, Nicolas Pitre  wrote:
>
> I think the pack v4 format and compatibility code is pretty stable now,
> and probably as optimal as it can reasonably be.
>
> @Junio: might be a good idea to refresh your pu branch again.
>
> Now the biggest problem to solve is the actual tree
> "deltification".  I don't have the time to spend on this otherwise very
> interesting problem (IMHO at least) so I'm sending this request for help
> in the hope that more people would be keen to contribute their computing
> skills to solve this challenge.

Just so we're clear as I'm involved a bit in packv4 series. I plan to
make the test suite fully pass, then add v4 support to git protocol.
After that I may look into adding multi-base tree support to
index-pack/unpack-objects, and only then either look into this
challenge or continue to add v4-aware tree walker to git. In short,
you you are intereted in Nico's challenge, go ahead, it will not
overlap with my work, at least in the next one or two months.
-- 
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Git pack v4: next step, help required

2013-09-18 Thread Nicolas Pitre


I think the pack v4 format and compatibility code is pretty stable now, 
and probably as optimal as it can reasonably be.

@Junio: might be a good idea to refresh your pu branch again.

Now the biggest problem to solve is the actual tree 
"deltification".  I don't have the time to spend on this otherwise very 
interesting problem (IMHO at least) so I'm sending this request for help 
in the hope that more people would be keen to contribute their computing 
skills to solve this challenge.

I'll make a quick description of the pv4 tree encoding first and explain 
the problem to solve afterwards.

A pv4 tree is comprised of two types of records: immediate tree entry 
and tree sequence copy.

1) Immediate Tree Entry

This is made of the following tuple:

,{SHA1 reference index>

The path component index refers to the path dictionary table where path 
strings and file modes are uniquely stored.  The SHA1 index refers to 
the sorted SHA1 table as previously found in the pack index but which is 
now part of the pack file itself.  So on average a single tree entry may 
take between 1 to 2 bytes for the path component index and 1 to 3 bytes 
for the SHA1 index.

2) Tree Sequence Copy

This is used to literally copy a contiguous list of tree entries from 
another tree object.  This goes as follows:

,
[,]

So instead of having arbitrary copying of data like in delta objects, we 
refer directly to tree entries in another object.  The big advantage 
here is that the tree walker may directly jump to the copied object 
without having to do any delta patching and caching. The SHA1 index is 
optional if it refers to the same copied object as the previous tree 
sequence copy record in this tree object.  And another possible 
optimization for the tree walker when enumerating objects is to skip the 
copy entry entirely if the object being copied from has already been 
enumerated.

The size of this entry is more variable and is typically between 1 to 2 
bytes for the start index, 1 to 2 bytes for the copy size, and 0 to 3 
bytes for the SHA1 index.

So... what to do with this?

Currently the "deltification" is achieved using a pretty dumb heuristic 
which is to simply take each tree object in a pack v2 with their 
corresponding base delta object and perform a straight conversion into 
pack v4 tree format i.e. use copy records whenever possible as long as 
they represent a space saving over the corresponding immediate tree 
entries (which is normally the case).

However this is rather sub-optimal for two reasons:

1) This doesn't benefit from the ability to use multiple base objects to 
   copy from and is potentially missing on additional opportunities for 
   copy sequences which are not possible in the selected base object 
   from the pack v2 delta base selection. Pack v4 is already quite 
   smaller than pack v2 yet it might possibly be smaller still.

2) This makes deep delta chains very inefficient at runtime when the 
   pack is subsequently accessed.

Let's consider this example to fully illustrate #2.

Tree A:
entry 0:"foo.txt"
entry 1-3:  copy from tree B: start=2 count=3

Tree B:
entry 0-5:  copy from tree C: start=2 count=5
entry 6:"bar.txt"

Tree C:
entry 0:"file_a"
entry 1:"file_b"
entry 2:"file_c"
entry 3:"file_D"
entry 4:"file_E"
entry 5:"file_F"
entry 6:"file_G"

This is a good example of what typically happens when the above heuristic 
is applied.  And it is not uncommon to see a long indirection chain of 
"copy those 2 entries from that other object" sometimes reaching 50 
levels deep where the same 2 (or more) entries require up to 50 object 
hops before they can be obtained.

Obviously here the encoding should be optimized to reduce the chaining 
effect simply by referring to the final object directly.  Hence tree A 
could be encoded as follows:

Tree A:
entry 0:"foo.txt"
entry 1-3:  copy from tree C: start=4 count=3

The on-disk encoding is likely to be the same size but the runtime 
access is greatly optimized.

Now... instead of trying to do reference simplification by factoring out 
the chain effect, I think a whole new approach to tree delta compression 
for pack v4 should be developed which would also address issue #1 above.

For example, we may try to make each canonical tree objects into 
sequences of hashed tree entries in memory where each tree entry would 
be reduced down to a single CRC32 value (or even Adler32 for speed).  
Each tree object would then be represented by some kind of 32-bit 
character "string".  Then it is just a matter of applying substring 
matching algorithms to find the best copy sequences without creating 
reference cycles, weighted by the number of indirections involved when 
a referred object already has a copy sequence cov

Re: Git pack v4: next step, help required

Git pack v4: next step, help required

2 matches

Site Navigation

Mail list logo

Footer information