On Wed, Apr 23, 2014 at 1:56 AM, Karsten Blees <karsten.bl...@gmail.com> wrote: > Am 22.04.2014 12:35, schrieb Duy Nguyen: >> On Tue, Apr 22, 2014 at 5:13 PM, Duy Nguyen <pclo...@gmail.com> wrote: >>>> IIRC name_hash.c::lazy_init_name_hash took ~100ms on my system, so >>>> hopefully you did a dummy 'cache_name_exists("anything")' before starting >>>> the measurement of the first run? >>> >>> No I didn't. Thanks for pointing it out. I'll see if I can reduce its time. >> >> Well name-hash is only used when core.ignorecase is set. So it's >> optional. > > This is only true for the case-insensitive directory hash. The file hash > ('cache_file_exists') is always used to skip expensive excluded checks for > tracked files. > > 'cache_file_exists' basically treats faster lookups for higher setup costs, > which makes perfect sense when scanning the entire work tree. However, if > most of the directory info is cached and just a few directories need refresh > (and core.ignorecase=false), binary search ('cache_name_pos') may be better. > The difficulty is to decide when to choose one over the other :-)
Right. The problem is even if untracked cache is used, we don't know in advance how cache_file_exists calls we need to make. If .gitignore changes, we could see how many directories are invalidated recursively and that could be an indicator for favoring cache_file_exists over cache_name_pos. It's harder when dir mtime changes, I suppose we could be optimistic and stick to cache_name_pos until the number of calls gets over a limit and turn to cache_file_exists. May backfire though.. > >> Maybe we could save it in a separate index extension, but we >> need to verify that the reader uses the same hash function as the >> writer. >> >>>> Similarly, the '--directory' option controls early returns from the >>>> directory scan (via read_directory_recursive's check_only argument), so >>>> you won't be able to get a full untracked files listing if the cache was >>>> recorded with '--directory'. Additionally, '--directory' aggregates the >>>> state at the topmost untracked directory, so that directory's cached state >>>> depends on all sub-directories as well... >>> >>> I missed this. We could ignore check_only if caching is enabled, but >>> that does not sound really good. Let me think about it more.. >> >> We could save "check_only" to the cache as well. This way we don't >> have to disable the check_only trick completely. >> >> So we process a directory with check_only set, find one untracked >> entry and stop short. We store check_only value and the status ("found >> something") in addition to dir mtime. Next time we check the dir's >> mtime. If it matches and is called with check_only set, we know there >> is at least one untracked entry, that's enough to stop r_d_r and >> return early. If dir mtime does not match, or r_d_r is called without >> check_only, we ignore the cached data and fall back to opendir. >> >> Sounds good? >> > > What about untracked files in sub-directories? E.g. you have untracked dirs > a/b with untracked file a/b/c, so normal 'git status' would list 'a/' as > untracked. > Now, 'rm a/b/c' would update mtime of b, but not of a, so you'd still list > 'a/' as untracked. Same thing for 'echo "c" >a/b/.gitignore'. > > Your solution could work if you additionally cache the directories that had > to be scanned to find that first untracked file (but you probably had that in > mind anyway). Basically all directories that are touched by r_d_r() will be cached. > If the cache is only used for certain dir_struct.flags combinations, you can > probably get around saving the check_only flag (which can only ever be true > if both DIR_SHOW_OTHER_DIRECTORIES and DIR_HIDE_EMPTY_DIRECTORIES are set > (which is the default for 'git status')). -- Duy -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html