Re: Store refreshed stat info in a separate file?

2014-04-24 Thread Duy Nguyen
On Sat, Apr 19, 2014 at 12:43 AM, Junio C Hamano gits...@pobox.com wrote:
 Having said that, I do not think there is a fundamental reason why
 the stat data has to live inside the same index file.  A separate
 file is just fine, as long as you can reliably detect that they went
 out of sync for whatever reason (e.g. the index proper updated, a
 stale stat file left beind), and storing the trailer checksum from
 the corresponding index in this new file is an obvious and good
 solution.

I've gone further and store index updates (including entry removals
and additions) to the second index file so that index I/O cost is now
proportional to the number of changed entries, not the work tree size
(sort of). Which makes it scale much better when the work tree is
huge. There is one flaw though. I'm expecting many yuck responses
from people. So let's try to settle it now, or drop the idea.

The idea is we can support another mode, where index content is stored
in two files, the small $GIT_DIR/index and large $GIT_DIR/index.base.
index contains changes that should be applied to index.base.
Whenever you do something to the index, index records those actions.
Git reads both index.base and index, then replay the action to have
the final index in memory. index.base contains full worktree data
and remains unchanged until index becomes too big/slow that changes
should be merged back to index.base. This works great (my prototype
passed the test suite), and even greater than index v5 because v5
still rewrites the whole index file when an entry is added or removed.

But there is a problem with atomic update. The good old rename() does
not work well with 2 files. This is not a problem with the C part, I
can still make atomic update work. Scripts, on the other hand, may
rely on mv or similar commands/functions to prepare a temp index and
move it to $GIT_DIR/index. The workaround is merge back two files into
a single index file so that scripts can mv $temp_index as before and
pay the whole-index I/O penalty. An alternative is store two files in
one, the one index file actually consists two subfiles. We avoid the
atomic update problem, but we pay I/O cost for writing 10MB every time
an index is updated (but not hashing 10MB file) and introduce a new
index format. This is even yuckier in my opinion.

Should I continue, or drop it?
-- 
Duy
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Store refreshed stat info in a separate file?

2014-04-18 Thread Duy Nguyen
With git status, writing refreshed index takes 252ms per total 1s,
361s/1.4s, 86ms/360ms on gentoo-x86, webkit and linux-2.6 respectively
(*). It's takes a significant amount of time from git status. And
this happens whenever you touch a single tracked file, then do git
status. We tried to solve this with index v5, but it's been years(?)
since its start as a GSoC project. So I'm thinking of another way
around..

The major cost of writing an index is the SHA-1 hashing. The bigger
the written part is, the higher cost we pay. So what if we write
stat-only data to a separate file? Think of it as an index extension,
only it stays outside the index. On webkit with 182k files, the stat
data size would be about 6MB (its index v4 is 15M for comparison). But
with stat-only we could employ some cheap but efficient compressing,
sd_dev, sd_uid and sd_gid are likely the same for every entry. And we
could store the stat data of updated entries only. So I'm hoping to
get that 6MB down to a few hundred KBs. That makes hashing lightning
fast.

So the idea is, when we do refresh, we note what entry has stat
updated. Then we write $GIT_DIR/index.stat (and leave $GIT_DIR/index
alone), which is a valid index except that it has zero entries and a
only one (new) extension storing (maybe compressed) stat data of
updated entries. The extension also contains the trailing SHA-1 of
$GIT_DIR/index for verification later. When we read $GIT_DIR/index, we
check for the existence of index.stat. If it does and its attached
SHA-1 matches, we overwrite some stat data with the info from
index.stat.

Back to the original question, I'm hoping to reduce some significant
numbers above to less than 10ms with this. So I see all good points
but no bad ones. Time to ask git@vger to give some. I'm actually
trying this idea in my untracked cache because I can't afford to lose
50% of the gain from untracked cache, just because I have to save some
bits in the giant $GIT_DIR/index and take the cost of rehashing.

(*) this is with the untracked cache enabled and total time is about
40% less than upstream git status. The numbers against upstream git
status are actually less signficant. But I have to think positive
that one day untracked cache may be merged :)
-- 
Duy
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Store refreshed stat info in a separate file?

2014-04-18 Thread Junio C Hamano
Duy Nguyen pclo...@gmail.com writes:

 The major cost of writing an index is the SHA-1 hashing. The bigger
 the written part is, the higher cost we pay. So what if we write
 stat-only data to a separate file? Think of it as an index extension,
 only it stays outside the index. On webkit with 182k files, the stat
 data size would be about 6MB (its index v4 is 15M for comparison). But
 with stat-only we could employ some cheap but efficient compressing,
 sd_dev, sd_uid and sd_gid are likely the same for every entry. And we
 could store the stat data of updated entries only. So I'm hoping to
 get that 6MB down to a few hundred KBs. That makes hashing lightning
 fast.

It is perfectly OK to store your verbose stat data after deflating
it in the index as an index extension, so storing 6MB that can be
compressed efficiently without compressing is dumb applies whether
the result is stored in the index or in a separate file, I would
think.

Having said that, I do not think there is a fundamental reason why
the stat data has to live inside the same index file.  A separate
file is just fine, as long as you can reliably detect that they went
out of sync for whatever reason (e.g. the index proper updated, a
stale stat file left beind), and storing the trailer checksum from
the corresponding index in this new file is an obvious and good
solution.

I am not sure if that should be called index.stat, though.  It is
more about untracked files.  The stat data for cached paths are in
the index proper, so what you are adding is not what we would call
stat info when we talk about the index.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html