Am 17.04.2014 07:51, schrieb Nguyễn Thái Ngọc Duy:
> This patch serves as a heads up about a feature I'm working on. I hope
> that by posting it early, people could double check if I have made
> some fundamental mistakes that completely ruin the idea. It's about
> speeding up "git status" by caching untracked file info in the index
> _if_ your file system supports it (more below).
> The whole WIP series is at
> I only post the real meat here. I'm aware of a few incomplete details
> in this patch, but nothing fundamentally wrong. So far the numbers are
> promising. ls-files is updated to run fill_directory() twice in a
> row and "ls-files -o --directory --no-empty-directory --exclude-standard"
> (with gcc -O0) gives me:
> first run second (cached) run
> gentoo-x86 500 ms 71.6 ms
> wine 140 ms 9.72 ms
> webkit 125 ms 6.88 ms
IIRC name_hash.c::lazy_init_name_hash took ~100ms on my system, so hopefully
you did a dummy 'cache_name_exists("anything")' before starting the measurement
of the first run?
> linux-2.6 106 ms 16.2 ms
> Basically untracked time is cut to one tenth in the best case
> scenario. The final numbers would be a bit higher because I haven't
> stored or read the cache from index yet. Real commit message follows..
> read_directory() plays a bit part in the slowness of "git status"
> because it has to read every directory and check for excluded entries,
> which is really expensive. This patch adds an option to cache the
> results so that after the first slow read_directory(), the following
> calls should be cheap and fast.
> The following inputs are sufficient to determine what files in a
> directory are excluded:
> - The list of files and directories of the direction in question
> - The $GIT_DIR/index
> - The content of $GIT_DIR/info/exclude
> - The content of core.excludesfile
> - The content (or the lack) of .gitignore of all parent directories
> from $GIT_WORK_TREE
The dir_struct.flags also play a big role in evaluation of read_directory.
E.g. it seems untracked files are not properly recorded if the cache is filled
with '--ignored' option:
> @@ -1360,15 +1603,18 @@ static enum path_treatment
> read_directory_recursive(struct dir_struct *dir,
> case path_untracked:
> - if (!(dir->flags & DIR_SHOW_IGNORED))
> - dir_add_name(dir, path.buf, path.len);
> + if (dir->flags & DIR_SHOW_IGNORED)
> + break;
> + dir_add_name(dir, path.buf, path.len);
> + if (cdir.fdir)
> + add_untracked(untracked, path.buf + baselen);
Similarly, the '--directory' option controls early returns from the directory
scan (via read_directory_recursive's check_only argument), so you won't be able
to get a full untracked files listing if the cache was recorded with
'--directory'. Additionally, '--directory' aggregates the state at the topmost
untracked directory, so that directory's cached state depends on all
sub-directories as well...
I wonder if it makes sense to separate cache recording logic from
read_directory_recursive and friends, which are mainly concerned with flags
> If we can cheaply validate all those inputs for a certain directory,
> we are sure that the current code will always produce the same
> results, so we can cache and reuse those results.
> This is not a silver bullet approach. When you compile a C file, for
> example, the old .o file is removed and a new one with the same name
> created, effectively invalidating the containing directory's
> cache. But at least with a large enough work tree, there could be many
> directories you never touch. The cache could help there.
> The first input can be checked using directory mtime. In many
> filesystems, directory mtime is updated when direct files/dirs are
> added or removed (*). If you do not use such a file system, this
> feature is not for you.
> The second one can be hooked from read-cache.c. Whenever a file (or a
> submodule) is added or removed from a directory, we invalidate that
> directory. This will be done in a later patch.
> The remaining inputs are easy, their SHA-1 could be used to verify
> their contents. We do need to read .gitignore files and digest
> them. But they are usually few and small, so the overhead should not
> be much.
> At the implementation level, the whole directory structure is saved,
> each directory corresponds to one struct untracked_dir.
With the usual options (e.g. standard 'git status'), untracked directories are
mostly skipped, so the cache would mostly store tracked directories. Naming it
'struct untracked_dir' is a bit confusing, IMO.
> Each directory
> holds SHA-1 of the .gitignore underneath (or null if it does not
> exist) and the list of untracked "files" and subdirs that need to
> recurse into if all is well. Untracked subdirectories are saved in the
> file queue and are the reason of quoting "files" in the previous
> On the first run, no untracked_dir is valid, the default code path is
> run. prep_exclude() is updated to record SHA-1 of .gitignore along the
> way. read_directory_recursive() is updated to record untracked files.
> On subsequent runs, read_directory_recursive() reads stat info of the
> directory in question and verifies if files/dirs have been added or
> removed. With the help of prep_exclude() to verify .gitignore chain,
> it may decide "all is well" and enable the fast path in
> treat_path(). read_directory_recursive() is still called for
> subdirectories even in fast path, because a directory mtime does not
> cover all subdirs recursively.
> So if all is really well, read_directory() becomes a series of
> open(".gitignore"), read(".gitignore"), close(), hash_sha1_file() and
> stat(<dir>) _without_ heavyweight exclude filtering. There should be
> no overhead if this feature is disabled.
Wouldn't mtime of .gitignore files suffice here (so you don't need to open and
parse them every time)?
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html