Re: [RFC] Speed up "git status" by caching untracked file info

Duy Nguyen Tue, 22 Apr 2014 17:54:07 -0700

On Wed, Apr 23, 2014 at 1:56 AM, Karsten Blees <[email protected]> wrote:
> Am 22.04.2014 12:35, schrieb Duy Nguyen:
>> On Tue, Apr 22, 2014 at 5:13 PM, Duy Nguyen <[email protected]> wrote:
>>>> IIRC name_hash.c::lazy_init_name_hash took ~100ms on my system, so 
>>>> hopefully you did a dummy 'cache_name_exists("anything")' before starting 
>>>> the measurement of the first run?
>>>
>>> No I didn't. Thanks for pointing it out. I'll see if I can reduce its time.
>>
>> Well name-hash is only used when core.ignorecase is set. So it's
>> optional.
>
> This is only true for the case-insensitive directory hash. The file hash 
> ('cache_file_exists') is always used to skip expensive excluded checks for 
> tracked files.
>
> 'cache_file_exists' basically treats faster lookups for higher setup costs, 
> which makes perfect sense when scanning the entire work tree. However, if 
> most of the directory info is cached and just a few directories need refresh 
> (and core.ignorecase=false), binary search ('cache_name_pos') may be better. 
> The difficulty is to decide when to choose one over the other :-)


Right. The problem is even if untracked cache is used, we don't know
in advance how cache_file_exists calls we need to make. If .gitignore
changes, we could see how many directories are invalidated recursively
and that could be an indicator for favoring cache_file_exists over
cache_name_pos. It's harder when dir mtime changes, I suppose we could
be optimistic and stick to cache_name_pos until the number of calls
gets over a limit and turn to cache_file_exists. May backfire though..

>
>> Maybe we could save it in a separate index extension, but we
>> need to verify that the reader uses the same hash function as the
>> writer.
>>
>>>> Similarly, the '--directory' option controls early returns from the 
>>>> directory scan (via read_directory_recursive's check_only argument), so 
>>>> you won't be able to get a full untracked files listing if the cache was 
>>>> recorded with '--directory'. Additionally, '--directory' aggregates the 
>>>> state at the topmost untracked directory, so that directory's cached state 
>>>> depends on all sub-directories as well...
>>>
>>> I missed this. We could ignore check_only if caching is enabled, but
>>> that does not sound really good. Let me think about it more..
>>
>> We could save "check_only" to the cache as well. This way we don't
>> have to disable the check_only trick completely.
>>
>> So we process a directory with check_only set, find one untracked
>> entry and stop short. We store check_only value and the status ("found
>> something") in addition to dir mtime. Next time we check the dir's
>> mtime. If it matches and is called with check_only set, we know there
>> is at least one untracked entry, that's enough to stop r_d_r and
>> return early. If dir mtime does not match, or r_d_r is called without
>> check_only, we ignore the cached data and fall back to opendir.
>>
>> Sounds good?
>>
>
> What about untracked files in sub-directories? E.g. you have untracked dirs 
> a/b with untracked file a/b/c, so normal 'git status' would list 'a/' as 
> untracked.
> Now, 'rm a/b/c' would update mtime of b, but not of a, so you'd still list 
> 'a/' as untracked. Same thing for 'echo "c" >a/b/.gitignore'.
>
> Your solution could work if you additionally cache the directories that had 
> to be scanned to find that first untracked file (but you probably had that in 
> mind anyway).

Basically all directories that are touched by r_d_r() will be cached.

> If the cache is only used for certain dir_struct.flags combinations, you can 
> probably get around saving the check_only flag (which can only ever be true 
> if both DIR_SHOW_OTHER_DIRECTORIES and DIR_HIDE_EMPTY_DIRECTORIES are set 
> (which is the default for 'git status')).
-- 
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] Speed up "git status" by caching untracked file info

Reply via email to