On Wed, Jun 26, 2013 at 7:45 PM, Jeff King <p...@peff.net> wrote:
>
> In particular, it seems like the slowness we saw with the v1 bitmap
> format is not what Shawn and Colby have experienced. So it's possible
> that our test setup is bad or different. Or maybe the C v1 reading
> implementation had some problems that are fixable. It's hard to say
> because we haven't shown any code that can be timed and compared.

Right, the format and implementation in JGit can do "Counting objects"
in 87ms for the Linux kernel history. But I think we are comparing
apples to steaks here, Vincent is (rightfully) concerned about process
startup performance, whereas our timings were assuming the process was
already running.

It would help everyone to understand the issues involved if we are at
least looking at the same file format.

> And the pack-order versus idx-order for the bitmaps is still up in the
> air. Do we have numbers on the on-disk sizes of the resulting EWAHs?

I did not see any presented in this thread, and I am very interested
in this aspect of the series. The path hash cache should be taking
about 9.9M of disk space, but I recall reading the bitmap file is 8M.
I don't understand.

Colby and I were very concerned about the size of the EWAH compressed
bitmaps because we wanted hundreds of them for a large history like
the kernel, and we wanted to minimize the amount of memory consumed by
the bitmap index when loaded into the process.

In the JGit implementation our copy of Linus' tree has 3.1M objects,
an 81.5 MiB idx file, and a 3.8 MiB bitmap file. We were trying to
keep the overhead below 10% of the idx file, and I think we have
succeeded on that. With 3.1M objects the v2 bitmap proposed in this
thread needs at least 11.8M, or 14+% overhead just for the path hash
cache.

The path hash cache may still be required, Colby and I have been
debating the merits of having the data available for delta compression
vs. the increase in memory required to hold it.

> The
> pack-order ones should be more amenable to run-length encoding,
> especially as you get further down into history (the tip ones would
> mostly be 1's, no matter how you order them).

This is also true for the type bitmaps, but especially so in the JGit
file ordering where we always write all trees before any blobs. The
type bitmaps are very compact and basically amount to defining a
single range in the file. This takes only a few words in the EWAH
compressed format.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to