Re: [PATCH 8/7] cat-file: sort and de-dup output of --batch-all-objects

2015-06-22 Thread Charles Bailey
On Mon, Jun 22, 2015 at 07:06:32AM -0400, Jeff King wrote:
 On Mon, Jun 22, 2015 at 06:33:21AM -0400, Jeff King wrote:
 
  By the way, in addition to not showing objects in order,
  list-all-objects (and my cat-file option) may show duplicates. Do we
  want to sort -u for the user? It might be nice for them to always get
  a de-duped and sorted list. Aside from the CPU cost of sorting, it does
  mean we'll allocate ~80MB for the kernel to store the sha1s. I guess
  that's not too much when you are talking about the kernel repo. I took
  the coward's way out and just mentioned the limitation in the
  documentation, but I'm happy to be persuaded.
 
 The patch below does the sort/de-dup. I'd probably just squash it into
 patch 7, though.

Woah, 8 out of 7! Did you get a chance to measure the performance hit of
the sort? If not, I may test it out when I next get the chance.
--
To unsubscribe from this list: send the line unsubscribe git in


Re: [PATCH 8/7] cat-file: sort and de-dup output of --batch-all-objects

2015-06-22 Thread Jeff King
On Mon, Jun 22, 2015 at 11:03:50PM +0100, Charles Bailey wrote:

  The patch below does the sort/de-dup. I'd probably just squash it into
  patch 7, though.
 
 Woah, 8 out of 7! Did you get a chance to measure the performance hit of
 the sort? If not, I may test it out when I next get the chance.

No, that last patch was my eh, one more thing before bed patch. ;)

It's easy enough to time, though. Running:

  git cat-file --batch-all-objects \
   --batch-check='%(objectsize) %(objectname)' \
   --buffer /dev/null

on linux.git, my best-of-five goes from (no sorting):

  real0m3.604s
  user0m3.556s
  sys 0m0.048s

to (with sorting):

  real0m4.053s
  user0m4.004s
  sys 0m0.052s

So it does matter, but not too much. We could de-dup with a hash table,
which might be a little faster, but I doubt it would make much
difference.  It's also mostly in sorted order already; it's possible
that a merge sort would behave a little better. I'm not sure how deep
it's worth going into that rabbit hole.

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in


[PATCH 8/7] cat-file: sort and de-dup output of --batch-all-objects

2015-06-22 Thread Jeff King
On Mon, Jun 22, 2015 at 06:33:21AM -0400, Jeff King wrote:

 By the way, in addition to not showing objects in order,
 list-all-objects (and my cat-file option) may show duplicates. Do we
 want to sort -u for the user? It might be nice for them to always get
 a de-duped and sorted list. Aside from the CPU cost of sorting, it does
 mean we'll allocate ~80MB for the kernel to store the sha1s. I guess
 that's not too much when you are talking about the kernel repo. I took
 the coward's way out and just mentioned the limitation in the
 documentation, but I'm happy to be persuaded.

The patch below does the sort/de-dup. I'd probably just squash it into
patch 7, though.

I did have one additional thought, though. We are treating this as two
separate operations: what are the sha1s in the repo and show me
information about this sha1. But by integrating with cat-file, we could
actually show information not just about a particular sha1, but about a
particular on-disk object.

E.g., if there are duplicates of a particular object, some formatters
like %(objectsize:disk) and %(deltabase) pick one arbitrarily to
show. I don't know if anybody actually cares about that in practice, but
if we show duplicates, we could give the accurate information for each
instance (and in fact we could give other information like loose vs
packed, which file contains the object, etc).

I tend to think that the lack of de-duping is sufficiently confusing
that it should be the default, and we can always add a no really, show
me the duplicates option later. It is not as simple as skipping the
de-dup step. We'd have to actually avoid calling sha1_object_info, and
use the information found in the loose/pack traversal (which would in
turn require exposing the low-level bits of sha1_object_info).

-- 8 --
Subject: cat-file: sort and de-dup output of --batch-all-objects

The sorting we could probably live without, but printing
duplicates is just a hassle for the user, who must then
de-dup themselves (or risk a wrong answer if they are doing
something like counting objects with a particular property).

Signed-off-by: Jeff King p...@peff.net
---
 Documentation/git-cat-file.txt |  3 +--
 builtin/cat-file.c | 22 +++---
 t/t1006-cat-file.sh|  3 +--
 3 files changed, 17 insertions(+), 11 deletions(-)

diff --git a/Documentation/git-cat-file.txt b/Documentation/git-cat-file.txt
index 6831b08..3105fc0 100644
--- a/Documentation/git-cat-file.txt
+++ b/Documentation/git-cat-file.txt
@@ -74,8 +74,7 @@ OPTIONS
requested batch operation on all objects in the repository and
any alternate object stores (not just reachable objects).
Requires `--batch` or `--batch-check` be specified. Note that
-   the order of the objects is unspecified, and there may be
-   duplicate entries.
+   the objects are visited in order sorted by their hashes.
 
 --buffer::
Normally batch output is flushed after each object is output, so
diff --git a/builtin/cat-file.c b/builtin/cat-file.c
index 95604c4..07baad1 100644
--- a/builtin/cat-file.c
+++ b/builtin/cat-file.c
@@ -9,6 +9,7 @@
 #include userdiff.h
 #include streaming.h
 #include tree-walk.h
+#include sha1-array.h
 
 struct batch_options {
int enabled;
@@ -324,19 +325,19 @@ struct object_cb_data {
struct expand_data *expand;
 };
 
-static int batch_object_cb(const unsigned char *sha1,
-  struct object_cb_data *data)
+static void batch_object_cb(const unsigned char sha1[20], void *vdata)
 {
+   struct object_cb_data *data = vdata;
hashcpy(data-expand-sha1, sha1);
batch_object_write(NULL, data-opt, data-expand);
-   return 0;
 }
 
 static int batch_loose_object(const unsigned char *sha1,
  const char *path,
  void *data)
 {
-   return batch_object_cb(sha1, data);
+   sha1_array_append(data, sha1);
+   return 0;
 }
 
 static int batch_packed_object(const unsigned char *sha1,
@@ -344,7 +345,8 @@ static int batch_packed_object(const unsigned char *sha1,
   uint32_t pos,
   void *data)
 {
-   return batch_object_cb(sha1, data);
+   sha1_array_append(data, sha1);
+   return 0;
 }
 
 static int batch_objects(struct batch_options *opt)
@@ -375,11 +377,17 @@ static int batch_objects(struct batch_options *opt)
data.info.typep = data.type;
 
if (opt-all_objects) {
+   struct sha1_array sa = SHA1_ARRAY_INIT;
struct object_cb_data cb;
+
+   for_each_loose_object(batch_loose_object, sa, 0);
+   for_each_packed_object(batch_packed_object, sa, 0);
+
cb.opt = opt;
cb.expand = data;
-   for_each_loose_object(batch_loose_object, cb, 0);
-   for_each_packed_object(batch_packed_object, cb, 0);
+   sha1_array_for_each_unique(sa,