On Wed, Feb 13 2019, Nguyễn Thái Ngọc Duy wrote:

> Index file size more or less translates to write time because we hash
> the entire file every time we update the index. And we update the index
> quite often (automatically index refresh is done everywhere). This means
> smaller index files are faster, especially true for very large
> worktrees.
>
> Index v4 attempts to reduce file size by "prefix compressing"
> paths. This reduces file size from 17% (git.git) to 41% (webkit.git,
> deep hierarchy).
>
> Index v5 takes the same idea to the next level. Instead of compressing
> just paths, based on the previous entry, we "compress" a lot more
> fields.
>
> Take a look at stat data, st_dev, st_uid, st_gid and st_mode are the
> same most of the time. ctime should often be the same (or differs just
> slightly). And sometimes mtime is the same as well. st_ino is also
> always zero on Windows. We're storing a lot of duplicate values.
>
> Index v5 handles this

This looks really promising.

>  - by adding a "same mask" per entry. If st_dev is the same as previous
>    entry, for instance, we set "st_dev is the same" flag and will not
>    store it at all, saving 31 bits per entry.
>
>  - even when we store it, "varint" encoding is used. We should rarely
>    need to write out 4 bytes
>
>  - for ctime and mtime, even if we have to store it, we store the offset
>    instead of absolute numbers. This often leads to smaller numbers,
>    which also means fewer bytes to encode.

Sounds good. I wonder if you've thought about/considered a couple of
optimizations on top of this, or if they're possible. Both share the
same theme:

* Instead of adding a "same as last mask" adding "same as Nth
  mask". Something similar exists in the Sereal format (which also has
  other techniques you use, e.g. varint
  https://github.com/Sereal/Sereal/blob/master/sereal_spec.pod#the-copy-tag)

  So instead of:

      <mask1><same><mask2><same><mask1><same> etc.

   You'd have:

      <mask1 (mark1)><same><mask2 (mark2)><same><insert: mark1><same> etc.

   I.e. when you have data that flip-flops a lot you can save space by
   saying "it's the same as existing earlier value at offset N". Maybe
   it doesn't make sense for this data, I don't know...

* For ctime/mtime presumably for dir paths, are these paths tolerant to
  or already out of glob() order? Then perhaps they can be pre-sorted so
  the compression or ctime/mtime offset compression is more effective.

> As a result of this, v5 reduces file size from 30% (git.git) to
> 36% (webkit.git) compared to v4. Comparing to v2, webkit.git index file
> size is reduced by 63%! A 8.4MB index file is _almost_ acceptable.
>
> Of course we trade off storage with cpu. We now need to spend more
> cycles writing or even reading (but still plenty fast compared to
> zlib). For reading, I'm counting on multi thread to hide away all this
> even if it becomes significant.

This would be a bigger change, but have we/you ever done a POC
experiment to see how much of this time is eaten up by zlib that
wouldn't be eaten up with some of the newer "fast but good enough"
compression algorithms, e.g. Snappy and Zstandard?

Reply via email to