On Thu, Oct 3, 2013 at 1:43 PM, Junio C Hamano <gits...@pobox.com> wrote:
> Nguyễn Thái Ngọc Duy <pclo...@gmail.com> writes:
>> The use case is
>> tar -xzf bigproject.tar.gz
>> cd bigproject
>> git init
>> git add .
>> # git grep or something
> Two obvious thoughts, and a half.
> (1) This particular invocation of "git add" can easily detect that
> it is run in a repository with no $GIT_INDEX_FILE yet, which is
> the most typical case for a big initial import. It could even
> ask if the current branch is unborn if you wanted to make the
> heuristic more specific to this use case. Perhaps it would
> make sense to automatically plug the bulk import machinery in
> such a case without an option?
Yeah! I did not even think of that.
> (2) Imagine performing a dry-run of update_files_in_cache() using a
> different diff-files callback that is similar to the
> update_callback() but that uses the lstat(2) data to see how
> big an import this really is, instead of calling
> add_file_to_index(), before actually registering the data to
> the object database. If you benchmark to see how expensive it
> is, you may find that such a scheme might be a workable
> auto-tuning mechanism to trigger this. Even if it were
> moderately expensive, when combined with the heuristics above
> for (1), it might be a worthwhile thing to do only when it is
> likely to be an initial import.
We do a lot of lstats nowadays for refreshing index so yeah it's
likely reasonably cheap, but I doubt people do mass update on existing
files often. Adding a large amount of new files (even when .git/index
exists) may be a better indication of an import and we already have
that information from fill_directory().
For the no .git/index case, packing with bulk-checkin probably
produces a reasonably good pack because they don't delta well anyway.
There are no previous versions to delta against. They can delta
against other files but I don't think we'll have good compression with
that. For the case where .git/index exists, we may intefere with "git
gc --auto". We create a less optimal pack, but it's a pack and may
delay gc time longer..
> (3) Is it always a good idea to send everything to a packfile on a
> large addition, or are you often better off importing the
> initial fileset as loose objects? If the latter, then the
> option name "--bulk" may give users a wrong hint "if you are
> doing a bulk-import, you are bettern off using this option".
Hard question. Fewer files are definitely a good thing, for example
when you "rm -rf" the whole thing :-) Disk usage is another. On
gdb-7.3.1, "du -sh" reports .git with loose objects takes 57M, while
the packed one takes 29M. Disk access is slightly faster on packed
.git, at least for "git grep --cached": 0.71s vs 0.83s.
> This is a very logical extension to what was started at 568508e7
> (bulk-checkin: replace fast-import based implementation,
> 2011-10-28), and I like it. I suspect "--bulk=<threashold>" might
> be a better alternative than setting the threshold unconditionally
> to zero, though.
The threshold may be better in form of a config setting because I
won't want to specify that every time. But does one really want to
keep some small files around in loose format? I don't see it because
my goal is to keep a clean .git, but maybe loose format has some
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html