Re: Git pack v4: next step, help required
On Thu, Sep 19, 2013 at 12:40 AM, Nicolas Pitre wrote: > > I think the pack v4 format and compatibility code is pretty stable now, > and probably as optimal as it can reasonably be. > > @Junio: might be a good idea to refresh your pu branch again. > > Now the biggest problem to solve is the actual tree > "deltification". I don't have the time to spend on this otherwise very > interesting problem (IMHO at least) so I'm sending this request for help > in the hope that more people would be keen to contribute their computing > skills to solve this challenge. Just so we're clear as I'm involved a bit in packv4 series. I plan to make the test suite fully pass, then add v4 support to git protocol. After that I may look into adding multi-base tree support to index-pack/unpack-objects, and only then either look into this challenge or continue to add v4-aware tree walker to git. In short, you you are intereted in Nico's challenge, go ahead, it will not overlap with my work, at least in the next one or two months. -- Duy -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Git pack v4: next step, help required
I think the pack v4 format and compatibility code is pretty stable now, and probably as optimal as it can reasonably be. @Junio: might be a good idea to refresh your pu branch again. Now the biggest problem to solve is the actual tree "deltification". I don't have the time to spend on this otherwise very interesting problem (IMHO at least) so I'm sending this request for help in the hope that more people would be keen to contribute their computing skills to solve this challenge. I'll make a quick description of the pv4 tree encoding first and explain the problem to solve afterwards. A pv4 tree is comprised of two types of records: immediate tree entry and tree sequence copy. 1) Immediate Tree Entry This is made of the following tuple: ,{SHA1 reference index> The path component index refers to the path dictionary table where path strings and file modes are uniquely stored. The SHA1 index refers to the sorted SHA1 table as previously found in the pack index but which is now part of the pack file itself. So on average a single tree entry may take between 1 to 2 bytes for the path component index and 1 to 3 bytes for the SHA1 index. 2) Tree Sequence Copy This is used to literally copy a contiguous list of tree entries from another tree object. This goes as follows: , [,] So instead of having arbitrary copying of data like in delta objects, we refer directly to tree entries in another object. The big advantage here is that the tree walker may directly jump to the copied object without having to do any delta patching and caching. The SHA1 index is optional if it refers to the same copied object as the previous tree sequence copy record in this tree object. And another possible optimization for the tree walker when enumerating objects is to skip the copy entry entirely if the object being copied from has already been enumerated. The size of this entry is more variable and is typically between 1 to 2 bytes for the start index, 1 to 2 bytes for the copy size, and 0 to 3 bytes for the SHA1 index. So... what to do with this? Currently the "deltification" is achieved using a pretty dumb heuristic which is to simply take each tree object in a pack v2 with their corresponding base delta object and perform a straight conversion into pack v4 tree format i.e. use copy records whenever possible as long as they represent a space saving over the corresponding immediate tree entries (which is normally the case). However this is rather sub-optimal for two reasons: 1) This doesn't benefit from the ability to use multiple base objects to copy from and is potentially missing on additional opportunities for copy sequences which are not possible in the selected base object from the pack v2 delta base selection. Pack v4 is already quite smaller than pack v2 yet it might possibly be smaller still. 2) This makes deep delta chains very inefficient at runtime when the pack is subsequently accessed. Let's consider this example to fully illustrate #2. Tree A: entry 0:"foo.txt" entry 1-3: copy from tree B: start=2 count=3 Tree B: entry 0-5: copy from tree C: start=2 count=5 entry 6:"bar.txt" Tree C: entry 0:"file_a" entry 1:"file_b" entry 2:"file_c" entry 3:"file_D" entry 4:"file_E" entry 5:"file_F" entry 6:"file_G" This is a good example of what typically happens when the above heuristic is applied. And it is not uncommon to see a long indirection chain of "copy those 2 entries from that other object" sometimes reaching 50 levels deep where the same 2 (or more) entries require up to 50 object hops before they can be obtained. Obviously here the encoding should be optimized to reduce the chaining effect simply by referring to the final object directly. Hence tree A could be encoded as follows: Tree A: entry 0:"foo.txt" entry 1-3: copy from tree C: start=4 count=3 The on-disk encoding is likely to be the same size but the runtime access is greatly optimized. Now... instead of trying to do reference simplification by factoring out the chain effect, I think a whole new approach to tree delta compression for pack v4 should be developed which would also address issue #1 above. For example, we may try to make each canonical tree objects into sequences of hashed tree entries in memory where each tree entry would be reduced down to a single CRC32 value (or even Adler32 for speed). Each tree object would then be represented by some kind of 32-bit character "string". Then it is just a matter of applying substring matching algorithms to find the best copy sequences without creating reference cycles, weighted by the number of indirections involved when a referred object already has a copy sequence cov