The use case is
tar -xzf bigproject.tar.gz
cd bigproject
git init
git add .
# git grep or something
The first add will generate a bunch of loose objects. With --bulk, all
of them are forced into a single pack instead, less clutter on disk
and maybe faster object access.
On gdb-7.5.1 source directory, the loose .git directory takes 66M
according to `du` while the packed one takes 32M. Timing of
"git grep --cached":
loose packed
real 0m1.671s 0m1.372s
user 0m1.542s 0m1.313s
sys 0m0.126s 0m0.056s
It's not an all-win situation though. --bulk is slower than --no-bulk
because:
- Triple hashing: we need to calculate both object SHA-1s _and_ pack
SHA-1. At the end we have to fix up the pack, which means rehashing
the entire pack again. --no-bulk only cares about object SHA-1s.
- We write duplicate objects to the pack then truncate it, because we
don't know if it's a duplicate until we're done writing, and cannot
keep it in core because it's potentially big. So extra I/O (but
hopefully not too much because duplicate objects should not happen
often).
- Sort and write .idx file.
- (For the future) --no-bulk could benefit from multithreading
because this is CPU bound operation. --bulk could not.
But again this comparison is not fair, --bulk is closer to:
git add . &&
git ls-files --stage | awk '{print $2;}'| \
git pack-objects .git/objects/pack-
except that it does not deltifies nor sort objects.
Signed-off-by: Nguyễn Thái Ngọc Duy <[email protected]>
---
v2 examines pros and cons of --bulk and I'm not sure if turning it on
automatically (with heuristics) is a good idea anymore.
Oh and it fixes not packing empty files.
Documentation/git-add.txt | 10 ++++++++++
builtin/add.c | 10 +++++++++-
sha1_file.c | 3 ++-
3 files changed, 21 insertions(+), 2 deletions(-)
diff --git a/Documentation/git-add.txt b/Documentation/git-add.txt
index 48754cb..147d191 100644
--- a/Documentation/git-add.txt
+++ b/Documentation/git-add.txt
@@ -160,6 +160,16 @@ today's "git add <pathspec>...", ignoring removed files.
be ignored, no matter if they are already present in the work
tree or not.
+--bulk::
+ Normally new objects are indexed and stored in loose format,
+ one file per new object in "$GIT_DIR/objects". This option
+ forces putting all objects into a single new pack. This may
+ be useful when you need to add a lot of files initially.
++
+This option is equivalent to running `git -c core.bigFileThreshold=0 add`.
+If you want to only pack files larger than a size threshold, use the
+long form.
+
\--::
This option can be used to separate command-line options from
the list of files, (useful when filenames might be mistaken
diff --git a/builtin/add.c b/builtin/add.c
index 226f758..40cbb71 100644
--- a/builtin/add.c
+++ b/builtin/add.c
@@ -336,7 +336,7 @@ static struct lock_file lock_file;
static const char ignore_error[] =
N_("The following paths are ignored by one of your .gitignore files:\n");
-static int verbose, show_only, ignored_too, refresh_only;
+static int verbose, show_only, ignored_too, refresh_only, bulk_index;
static int ignore_add_errors, intent_to_add, ignore_missing;
#define ADDREMOVE_DEFAULT 0 /* Change to 1 in Git 2.0 */
@@ -368,6 +368,7 @@ static struct option builtin_add_options[] = {
OPT_BOOL( 0 , "refresh", &refresh_only, N_("don't add, only refresh the
index")),
OPT_BOOL( 0 , "ignore-errors", &ignore_add_errors, N_("just skip files
which cannot be added because of errors")),
OPT_BOOL( 0 , "ignore-missing", &ignore_missing, N_("check if - even
missing - files are ignored in dry run")),
+ OPT_BOOL( 0 , "bulk", &bulk_index, N_("pack all objects instead of
creating loose ones")),
OPT_END(),
};
@@ -560,6 +561,13 @@ int cmd_add(int argc, const char **argv, const char
*prefix)
free(seen);
}
+ if (bulk_index)
+ /*
+ * Pretend all blobs are "large" files, forcing them
+ * all into a pack
+ */
+ big_file_threshold = 0;
+
plug_bulk_checkin();
if ((flags & ADD_CACHE_IMPLICIT_DOT) && prefix) {
diff --git a/sha1_file.c b/sha1_file.c
index f80bbe4..8b66840 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -3137,7 +3137,8 @@ int index_fd(unsigned char *sha1, int fd, struct stat *st,
if (!S_ISREG(st->st_mode))
ret = index_pipe(sha1, fd, type, path, flags);
- else if (size <= big_file_threshold || type != OBJ_BLOB ||
+ else if ((big_file_threshold && size <= big_file_threshold) ||
+ type != OBJ_BLOB ||
(path && would_convert_to_git(path, NULL, 0, 0)))
ret = index_core(sha1, fd, size, type, path, flags);
else
--
1.8.2.82.gc24b958
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html