Re: [RFC] Speed up "git status" by caching untracked file info

Karsten Blees Tue, 22 Apr 2014 02:57:14 -0700

Am 17.04.2014 07:51, schrieb Nguyễn Thái Ngọc Duy:
> This patch serves as a heads up about a feature I'm working on. I hope
> that by posting it early, people could double check if I have made
> some fundamental mistakes that completely ruin the idea. It's about
> speeding up "git status" by caching untracked file info in the index
> _if_ your file system supports it (more below).
> 
> The whole WIP series is at
> 
> https://github.com/pclouds/git/commits/untracked-cache
> 
> I only post the real meat here. I'm aware of a few incomplete details
> in this patch, but nothing fundamentally wrong. So far the numbers are
> promising.  ls-files is updated to run fill_directory() twice in a
> row and "ls-files -o --directory --no-empty-directory --exclude-standard"
> (with gcc -O0) gives me:
> 
>            first run  second (cached) run
> gentoo-x86    500 ms             71.6  ms
> wine          140 ms              9.72 ms
> webkit        125 ms              6.88 ms


IIRC name_hash.c::lazy_init_name_hash took ~100ms on my system, so hopefully 
you did a dummy 'cache_name_exists("anything")' before starting the measurement 
of the first run?

> linux-2.6     106 ms             16.2  ms
> 
> Basically untracked time is cut to one tenth in the best case
> scenario. The final numbers would be a bit higher because I haven't
> stored or read the cache from index yet. Real commit message follows..
> 
> 
> read_directory() plays a bit part in the slowness of "git status"
> because it has to read every directory and check for excluded entries,
> which is really expensive. This patch adds an option to cache the
> results so that after the first slow read_directory(), the following
> calls should be cheap and fast.
> 
> The following inputs are sufficient to determine what files in a
> directory are excluded:
> 
>  - The list of files and directories of the direction in question
>  - The $GIT_DIR/index
>  - The content of $GIT_DIR/info/exclude
>  - The content of core.excludesfile
>  - The content (or the lack) of .gitignore of all parent directories
>    from $GIT_WORK_TREE
> 

The dir_struct.flags also play a big role in evaluation of read_directory.

E.g. it seems untracked files are not properly recorded if the cache is filled 
with '--ignored' option:

> @@ -1360,15 +1603,18 @@ static enum path_treatment 
> read_directory_recursive(struct dir_struct *dir,
>                       break;
>  
>               case path_untracked:
> -                     if (!(dir->flags & DIR_SHOW_IGNORED))
> -                             dir_add_name(dir, path.buf, path.len);
> +                     if (dir->flags & DIR_SHOW_IGNORED)
> +                             break;
> +                     dir_add_name(dir, path.buf, path.len);
> +                     if (cdir.fdir)
> +                             add_untracked(untracked, path.buf + baselen);
>                       break;

Similarly, the '--directory' option controls early returns from the directory 
scan (via read_directory_recursive's check_only argument), so you won't be able 
to get a full untracked files listing if the cache was recorded with 
'--directory'. Additionally, '--directory' aggregates the state at the topmost 
untracked directory, so that directory's cached state depends on all 
sub-directories as well...

I wonder if it makes sense to separate cache recording logic from 
read_directory_recursive and friends, which are mainly concerned with flags 
processing.

> If we can cheaply validate all those inputs for a certain directory,
> we are sure that the current code will always produce the same
> results, so we can cache and reuse those results.
> 
> This is not a silver bullet approach. When you compile a C file, for
> example, the old .o file is removed and a new one with the same name
> created, effectively invalidating the containing directory's
> cache. But at least with a large enough work tree, there could be many
> directories you never touch. The cache could help there.
> 
> The first input can be checked using directory mtime. In many
> filesystems, directory mtime is updated when direct files/dirs are
> added or removed (*). If you do not use such a file system, this
> feature is not for you.
> 
> The second one can be hooked from read-cache.c. Whenever a file (or a
> submodule) is added or removed from a directory, we invalidate that
> directory. This will be done in a later patch.
> 
> The remaining inputs are easy, their SHA-1 could be used to verify
> their contents. We do need to read .gitignore files and digest
> them. But they are usually few and small, so the overhead should not
> be much.
> 
> At the implementation level, the whole directory structure is saved,
> each directory corresponds to one struct untracked_dir.

With the usual options (e.g. standard 'git status'), untracked directories are 
mostly skipped, so the cache would mostly store tracked directories. Naming it 
'struct untracked_dir' is a bit confusing, IMO.

> Each directory
> holds SHA-1 of the .gitignore underneath (or null if it does not
> exist) and the list of untracked "files" and subdirs that need to
> recurse into if all is well. Untracked subdirectories are saved in the
> file queue and are the reason of quoting "files" in the previous
> sentence.
> 
> On the first run, no untracked_dir is valid, the default code path is
> run. prep_exclude() is updated to record SHA-1 of .gitignore along the
> way. read_directory_recursive() is updated to record untracked files.
> 
> On subsequent runs, read_directory_recursive() reads stat info of the
> directory in question and verifies if files/dirs have been added or
> removed. With the help of prep_exclude() to verify .gitignore chain,
> it may decide "all is well" and enable the fast path in
> treat_path(). read_directory_recursive() is still called for
> subdirectories even in fast path, because a directory mtime does not
> cover all subdirs recursively.
> 
> So if all is really well, read_directory() becomes a series of
> open(".gitignore"), read(".gitignore"), close(), hash_sha1_file() and
> stat(<dir>) _without_ heavyweight exclude filtering. There should be
> no overhead if this feature is disabled.
> 

Wouldn't mtime of .gitignore files suffice here (so you don't need to open and 
parse them every time)?


--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] Speed up "git status" by caching untracked file info

Reply via email to