Re: [PATCH 4/9] get_short_oid: sort ambiguous objects by type, then SHA-1

2018-05-01 Thread Derrick Stolee

On 5/1/2018 8:36 AM, Ævar Arnfjörð Bjarmason wrote:

On Tue, May 01 2018, Derrick Stolee wrote:


How would sorting in our custom order before de-duplicating fail the
de-duplication? We will still pair identical OIDs as consecutive
elements and oid_array_for_each_unique only cares about consecutive
elements having distinct OIDs, not lex-ordered OIDs.

Because there's no de-duplication without the array first being sorted
in oidcmp() order, which oid_array_for_each_unique() checks for and
re-sorts if !array->sorted. I.e. its de-duplication is just a state
machine where it won't call the callback if the currently processed
element has the same SHA1 as the last one.


Perhaps the noise is because we rely on oid_array_sort() to mark the
array as sorted inside oid_array_for_each_unique(), but that could be
remedied by calling our QSORT() inside for_each_abbrev() and marking
the array as sorted before calling oid_array_for_each_unique().

As noted above this won't work, because the function inherently relies
on the array being sorted to be able to de-duplicate. Doing this will
yield duplicate entries.


I'm confused as to why my suggestion doesn't work, so I made it 
concrete. I sent an alternate commit 6/12 to your v2 series [1].


Thanks,
-Stolee

[1] 
https://public-inbox.org/git/20180501130318.58251-1-dsto...@microsoft.com/T/#u


Re: [PATCH 4/9] get_short_oid: sort ambiguous objects by type, then SHA-1

2018-05-01 Thread Ævar Arnfjörð Bjarmason

On Tue, May 01 2018, Derrick Stolee wrote:

> On 5/1/2018 7:27 AM, Ævar Arnfjörð Bjarmason wrote:
>> On Tue, May 01 2018, Derrick Stolee wrote:
>>
>>> On 4/30/2018 6:07 PM, Ævar Arnfjörð Bjarmason wrote:
 Since we show the commit data in the output that's nicely aligned once
 we sort by object type. The decision to show tags before commits is
 pretty arbitrary, but it's much less likely that we'll display a tag,
 so if there is one it makes sense to show it first.
>>> Here's a non-arbitrary reason: the object types are ordered
>>> topologically (ignoring self-references):
>>>
>>> tag -> commit, tree, blob
>>> commit -> tree
>>> tree -> blob
>> Thanks. I'll add a patch with that comment to v2.
>>
 @@ -421,7 +451,12 @@ static int get_short_oid(const char *name, int len, 
 struct object_id *oid,
ds.fn = NULL;
advise(_("The candidates are:"));
 -  for_each_abbrev(ds.hex_pfx, show_ambiguous_object, &ds);
 +  for_each_abbrev(ds.hex_pfx, collect_ambiguous, &collect);
 +  QSORT(collect.oid, collect.nr, sort_ambiguous);
>>> I was wondering how the old code sorted by SHA even when the ambiguous
>>> objects were loaded from different sources (multiple pack-files, loose
>>> objects). Turns out that for_each_abbrev() does its own sort after
>>> collecting the SHAs and then calls the given function pointer only
>>> once per distinct object. This avoids multiple instances of the same
>>> object, which may appear multiple times across pack-files.
>>>
>>> I only ask because now we are doing two sorts. I wonder if it would be
>>> more elegant to provide your sorting algorithm to for_each_abbrev()
>>> and let it call show_ambiguous_object as before.
>>>
>>> Another question is if we should use this sort generally for all calls
>>> to for_each_abbrev(). The only other case I see is in
>>> builtin/revparse.c.
>> When preparing v2 I realized how confusing this was, so I'd added this
>> to the commit message of my WIP re-roll which should explain this:
>>
>>  A note on the implementation: I started out with something much
>>  simpler which just replaced oid_array_sort() in sha1-array.c with a
>>  custom sort function before calling oid_array_for_each_unique(). But
>>  then dumbly noticed that it doesn't work because the output function
>>  was tangled up with the code added in fad6b9e590 ("for_each_abbrev:
>>  drop duplicate objects", 2016-09-26) to ensure we don't display
>>  duplicate objects.
>>   That's why we're doing two passes here, first we need to
>> sort the list
>>  and de-duplicate the objects, then sort them in our custom order, and
>>  finally output them without re-sorting them. I suppose we could also
>>  make oid_array_for_each_unique() maintain a hashmap of emitted
>>  objects, but that would increase its memory profile and wouldn't be
>>  worth the complexity for this one-off use-case,
>>  oid_array_for_each_unique() is used in many other places.
>
> How would sorting in our custom order before de-duplicating fail the
> de-duplication? We will still pair identical OIDs as consecutive
> elements and oid_array_for_each_unique only cares about consecutive
> elements having distinct OIDs, not lex-ordered OIDs.

Because there's no de-duplication without the array first being sorted
in oidcmp() order, which oid_array_for_each_unique() checks for and
re-sorts if !array->sorted. I.e. its de-duplication is just a state
machine where it won't call the callback if the currently processed
element has the same SHA1 as the last one.

> Perhaps the noise is because we rely on oid_array_sort() to mark the
> array as sorted inside oid_array_for_each_unique(), but that could be
> remedied by calling our QSORT() inside for_each_abbrev() and marking
> the array as sorted before calling oid_array_for_each_unique().

As noted above this won't work, because the function inherently relies
on the array being sorted to be able to de-duplicate. Doing this will
yield duplicate entries.


Re: [PATCH 4/9] get_short_oid: sort ambiguous objects by type, then SHA-1

2018-05-01 Thread Derrick Stolee

On 5/1/2018 7:27 AM, Ævar Arnfjörð Bjarmason wrote:

On Tue, May 01 2018, Derrick Stolee wrote:


On 4/30/2018 6:07 PM, Ævar Arnfjörð Bjarmason wrote:

Since we show the commit data in the output that's nicely aligned once
we sort by object type. The decision to show tags before commits is
pretty arbitrary, but it's much less likely that we'll display a tag,
so if there is one it makes sense to show it first.

Here's a non-arbitrary reason: the object types are ordered
topologically (ignoring self-references):

tag -> commit, tree, blob
commit -> tree
tree -> blob

Thanks. I'll add a patch with that comment to v2.


@@ -421,7 +451,12 @@ static int get_short_oid(const char *name, int len, struct 
object_id *oid,
ds.fn = NULL;
advise(_("The candidates are:"));
-   for_each_abbrev(ds.hex_pfx, show_ambiguous_object, &ds);
+   for_each_abbrev(ds.hex_pfx, collect_ambiguous, &collect);
+   QSORT(collect.oid, collect.nr, sort_ambiguous);

I was wondering how the old code sorted by SHA even when the ambiguous
objects were loaded from different sources (multiple pack-files, loose
objects). Turns out that for_each_abbrev() does its own sort after
collecting the SHAs and then calls the given function pointer only
once per distinct object. This avoids multiple instances of the same
object, which may appear multiple times across pack-files.

I only ask because now we are doing two sorts. I wonder if it would be
more elegant to provide your sorting algorithm to for_each_abbrev()
and let it call show_ambiguous_object as before.

Another question is if we should use this sort generally for all calls
to for_each_abbrev(). The only other case I see is in
builtin/revparse.c.

When preparing v2 I realized how confusing this was, so I'd added this
to the commit message of my WIP re-roll which should explain this:

 A note on the implementation: I started out with something much
 simpler which just replaced oid_array_sort() in sha1-array.c with a
 custom sort function before calling oid_array_for_each_unique(). But
 then dumbly noticed that it doesn't work because the output function
 was tangled up with the code added in fad6b9e590 ("for_each_abbrev:
 drop duplicate objects", 2016-09-26) to ensure we don't display
 duplicate objects.
 
 That's why we're doing two passes here, first we need to sort the list

 and de-duplicate the objects, then sort them in our custom order, and
 finally output them without re-sorting them. I suppose we could also
 make oid_array_for_each_unique() maintain a hashmap of emitted
 objects, but that would increase its memory profile and wouldn't be
 worth the complexity for this one-off use-case,
 oid_array_for_each_unique() is used in many other places.


How would sorting in our custom order before de-duplicating fail the 
de-duplication? We will still pair identical OIDs as consecutive 
elements and oid_array_for_each_unique only cares about consecutive 
elements having distinct OIDs, not lex-ordered OIDs.


Perhaps the noise is because we rely on oid_array_sort() to mark the 
array as sorted inside oid_array_for_each_unique(), but that could be 
remedied by calling our QSORT() inside for_each_abbrev() and marking the 
array as sorted before calling oid_array_for_each_unique().


(Again, my comments are not meant to block this series.)

Thanks,
-Stolee


Re: [PATCH 4/9] get_short_oid: sort ambiguous objects by type, then SHA-1

2018-05-01 Thread Ævar Arnfjörð Bjarmason

On Tue, May 01 2018, Derrick Stolee wrote:

> On 4/30/2018 6:07 PM, Ævar Arnfjörð Bjarmason wrote:
>> Since we show the commit data in the output that's nicely aligned once
>> we sort by object type. The decision to show tags before commits is
>> pretty arbitrary, but it's much less likely that we'll display a tag,
>> so if there is one it makes sense to show it first.
>
> Here's a non-arbitrary reason: the object types are ordered
> topologically (ignoring self-references):
>
> tag -> commit, tree, blob
> commit -> tree
> tree -> blob

Thanks. I'll add a patch with that comment to v2.

>> @@ -421,7 +451,12 @@ static int get_short_oid(const char *name, int len, 
>> struct object_id *oid,
>>  ds.fn = NULL;
>>  advise(_("The candidates are:"));
>> -for_each_abbrev(ds.hex_pfx, show_ambiguous_object, &ds);
>> +for_each_abbrev(ds.hex_pfx, collect_ambiguous, &collect);
>> +QSORT(collect.oid, collect.nr, sort_ambiguous);
>
> I was wondering how the old code sorted by SHA even when the ambiguous
> objects were loaded from different sources (multiple pack-files, loose
> objects). Turns out that for_each_abbrev() does its own sort after
> collecting the SHAs and then calls the given function pointer only
> once per distinct object. This avoids multiple instances of the same
> object, which may appear multiple times across pack-files.
>
> I only ask because now we are doing two sorts. I wonder if it would be
> more elegant to provide your sorting algorithm to for_each_abbrev()
> and let it call show_ambiguous_object as before.
>
> Another question is if we should use this sort generally for all calls
> to for_each_abbrev(). The only other case I see is in
> builtin/revparse.c.

When preparing v2 I realized how confusing this was, so I'd added this
to the commit message of my WIP re-roll which should explain this:

A note on the implementation: I started out with something much
simpler which just replaced oid_array_sort() in sha1-array.c with a
custom sort function before calling oid_array_for_each_unique(). But
then dumbly noticed that it doesn't work because the output function
was tangled up with the code added in fad6b9e590 ("for_each_abbrev:
drop duplicate objects", 2016-09-26) to ensure we don't display
duplicate objects.

That's why we're doing two passes here, first we need to sort the list
and de-duplicate the objects, then sort them in our custom order, and
finally output them without re-sorting them. I suppose we could also
make oid_array_for_each_unique() maintain a hashmap of emitted
objects, but that would increase its memory profile and wouldn't be
worth the complexity for this one-off use-case,
oid_array_for_each_unique() is used in many other places.


Re: [PATCH 4/9] get_short_oid: sort ambiguous objects by type, then SHA-1

2018-05-01 Thread Derrick Stolee

On 4/30/2018 6:07 PM, Ævar Arnfjörð Bjarmason wrote:

Change the output emitted when an ambiguous object is encountered so
that we show tags first, then commits, followed by trees, and finally
blobs. Within each type we show objects in hashcmp(). Before this
change the objects were only ordered by hashcmp().

The reason for doing this is that the output looks better as a result,
e.g. the v2.17.0 tag before this change on "git show e8f2" would
display:

 hint: The candidates are:
 hint:   e8f2093055 tree
 hint:   e8f21caf94 commit 2013-06-24 - bash prompt: print unique detached 
HEAD abbreviated object name
 hint:   e8f21d02f7 blob
 hint:   e8f21d577c blob
 hint:   e8f25a3a50 tree
 hint:   e8f26250fa commit 2017-02-03 - Merge pull request #996 from 
jeffhostetler/jeffhostetler/register_rename_src
 hint:   e8f2650052 tag v2.17.0
 hint:   e8f2867228 blob
 hint:   e8f28d537c tree
 hint:   e8f2a35526 blob
 hint:   e8f2bc0c06 commit 2015-05-10 - Documentation: note behavior for 
multiple remote.url entries
 hint:   e8f2cf6ec0 tree

Now we'll instead show:

 hint:   e8f2650052 tag v2.17.0
 hint:   e8f21caf94 commit 2013-06-24 - bash prompt: print unique detached 
HEAD abbreviated object name
 hint:   e8f26250fa commit 2017-02-03 - Merge pull request #996 from 
jeffhostetler/jeffhostetler/register_rename_src
 hint:   e8f2bc0c06 commit 2015-05-10 - Documentation: note behavior for 
multiple remote.url entries
 hint:   e8f2093055 tree
 hint:   e8f25a3a50 tree
 hint:   e8f28d537c tree
 hint:   e8f2cf6ec0 tree
 hint:   e8f21d02f7 blob
 hint:   e8f21d577c blob
 hint:   e8f2867228 blob
 hint:   e8f2a35526 blob

Since we show the commit data in the output that's nicely aligned once
we sort by object type. The decision to show tags before commits is
pretty arbitrary, but it's much less likely that we'll display a tag,
so if there is one it makes sense to show it first.


Here's a non-arbitrary reason: the object types are ordered 
topologically (ignoring self-references):


tag -> commit, tree, blob
commit -> tree
tree -> blob


Signed-off-by: Ævar Arnfjörð Bjarmason 
---
  sha1-array.c | 15 +++
  sha1-array.h |  3 +++
  sha1-name.c  | 37 -
  3 files changed, 54 insertions(+), 1 deletion(-)

diff --git a/sha1-array.c b/sha1-array.c
index 838b3bf847..48bd9e9230 100644
--- a/sha1-array.c
+++ b/sha1-array.c
@@ -41,6 +41,21 @@ void oid_array_clear(struct oid_array *array)
array->sorted = 0;
  }
  
+

+int oid_array_for_each(struct oid_array *array,
+  for_each_oid_fn fn,
+  void *data)
+{
+   int i;
+
+   for (i = 0; i < array->nr; i++) {
+   int ret = fn(array->oid + i, data);
+   if (ret)
+   return ret;
+   }
+   return 0;
+}
+
  int oid_array_for_each_unique(struct oid_array *array,
for_each_oid_fn fn,
void *data)
diff --git a/sha1-array.h b/sha1-array.h
index 1e1d24b009..232bf95017 100644
--- a/sha1-array.h
+++ b/sha1-array.h
@@ -16,6 +16,9 @@ void oid_array_clear(struct oid_array *array);
  
  typedef int (*for_each_oid_fn)(const struct object_id *oid,

   void *data);
+int oid_array_for_each(struct oid_array *array,
+  for_each_oid_fn fn,
+  void *data);
  int oid_array_for_each_unique(struct oid_array *array,
  for_each_oid_fn fn,
  void *data);
diff --git a/sha1-name.c b/sha1-name.c
index 9d7bbd3e96..46d8b1afa6 100644
--- a/sha1-name.c
+++ b/sha1-name.c
@@ -378,6 +378,34 @@ static int collect_ambiguous(const struct object_id *oid, 
void *data)
return 0;
  }
  
+static int sort_ambiguous(const void *a, const void *b)

+{
+   int a_type = oid_object_info(a, NULL);
+   int b_type = oid_object_info(b, NULL);
+   int a_type_sort;
+   int b_type_sort;
+
+   /*
+* Sorts by hash within the same object type, just as
+* oid_array_for_each_unique() would do.
+*/
+   if (a_type == b_type)
+   return oidcmp(a, b);
+
+   /*
+* Between object types show tags, then commits, and finally
+* trees and blobs.
+*
+* The object_type enum is commit, tree, blob, tag, but we
+* want tag, commit, tree blob. Cleverly (perhaps too
+* cleverly) do that with modulus, since the enum assigns 1 to
+* commit, so tag becomes 0.
+*/


I appreciate this comment. Clever things should be marked as such.


+   a_type_sort = a_type % 4;
+   b_type_sort = b_type % 4;
+   return a_type_sort > b_type_sort ? 1 : -1;
+}
+
  static int get_short_oid(const char *name, int len, struct object_id *oid,
  unsigned flags)
  {
@@ -409,6 +437,8 @@ static int get_sho

[PATCH 4/9] get_short_oid: sort ambiguous objects by type, then SHA-1

2018-04-30 Thread Ævar Arnfjörð Bjarmason
Change the output emitted when an ambiguous object is encountered so
that we show tags first, then commits, followed by trees, and finally
blobs. Within each type we show objects in hashcmp(). Before this
change the objects were only ordered by hashcmp().

The reason for doing this is that the output looks better as a result,
e.g. the v2.17.0 tag before this change on "git show e8f2" would
display:

hint: The candidates are:
hint:   e8f2093055 tree
hint:   e8f21caf94 commit 2013-06-24 - bash prompt: print unique detached 
HEAD abbreviated object name
hint:   e8f21d02f7 blob
hint:   e8f21d577c blob
hint:   e8f25a3a50 tree
hint:   e8f26250fa commit 2017-02-03 - Merge pull request #996 from 
jeffhostetler/jeffhostetler/register_rename_src
hint:   e8f2650052 tag v2.17.0
hint:   e8f2867228 blob
hint:   e8f28d537c tree
hint:   e8f2a35526 blob
hint:   e8f2bc0c06 commit 2015-05-10 - Documentation: note behavior for 
multiple remote.url entries
hint:   e8f2cf6ec0 tree

Now we'll instead show:

hint:   e8f2650052 tag v2.17.0
hint:   e8f21caf94 commit 2013-06-24 - bash prompt: print unique detached 
HEAD abbreviated object name
hint:   e8f26250fa commit 2017-02-03 - Merge pull request #996 from 
jeffhostetler/jeffhostetler/register_rename_src
hint:   e8f2bc0c06 commit 2015-05-10 - Documentation: note behavior for 
multiple remote.url entries
hint:   e8f2093055 tree
hint:   e8f25a3a50 tree
hint:   e8f28d537c tree
hint:   e8f2cf6ec0 tree
hint:   e8f21d02f7 blob
hint:   e8f21d577c blob
hint:   e8f2867228 blob
hint:   e8f2a35526 blob

Since we show the commit data in the output that's nicely aligned once
we sort by object type. The decision to show tags before commits is
pretty arbitrary, but it's much less likely that we'll display a tag,
so if there is one it makes sense to show it first.

Signed-off-by: Ævar Arnfjörð Bjarmason 
---
 sha1-array.c | 15 +++
 sha1-array.h |  3 +++
 sha1-name.c  | 37 -
 3 files changed, 54 insertions(+), 1 deletion(-)

diff --git a/sha1-array.c b/sha1-array.c
index 838b3bf847..48bd9e9230 100644
--- a/sha1-array.c
+++ b/sha1-array.c
@@ -41,6 +41,21 @@ void oid_array_clear(struct oid_array *array)
array->sorted = 0;
 }
 
+
+int oid_array_for_each(struct oid_array *array,
+  for_each_oid_fn fn,
+  void *data)
+{
+   int i;
+
+   for (i = 0; i < array->nr; i++) {
+   int ret = fn(array->oid + i, data);
+   if (ret)
+   return ret;
+   }
+   return 0;
+}
+
 int oid_array_for_each_unique(struct oid_array *array,
for_each_oid_fn fn,
void *data)
diff --git a/sha1-array.h b/sha1-array.h
index 1e1d24b009..232bf95017 100644
--- a/sha1-array.h
+++ b/sha1-array.h
@@ -16,6 +16,9 @@ void oid_array_clear(struct oid_array *array);
 
 typedef int (*for_each_oid_fn)(const struct object_id *oid,
   void *data);
+int oid_array_for_each(struct oid_array *array,
+  for_each_oid_fn fn,
+  void *data);
 int oid_array_for_each_unique(struct oid_array *array,
  for_each_oid_fn fn,
  void *data);
diff --git a/sha1-name.c b/sha1-name.c
index 9d7bbd3e96..46d8b1afa6 100644
--- a/sha1-name.c
+++ b/sha1-name.c
@@ -378,6 +378,34 @@ static int collect_ambiguous(const struct object_id *oid, 
void *data)
return 0;
 }
 
+static int sort_ambiguous(const void *a, const void *b)
+{
+   int a_type = oid_object_info(a, NULL);
+   int b_type = oid_object_info(b, NULL);
+   int a_type_sort;
+   int b_type_sort;
+
+   /*
+* Sorts by hash within the same object type, just as
+* oid_array_for_each_unique() would do.
+*/
+   if (a_type == b_type)
+   return oidcmp(a, b);
+
+   /*
+* Between object types show tags, then commits, and finally
+* trees and blobs.
+*
+* The object_type enum is commit, tree, blob, tag, but we
+* want tag, commit, tree blob. Cleverly (perhaps too
+* cleverly) do that with modulus, since the enum assigns 1 to
+* commit, so tag becomes 0.
+*/
+   a_type_sort = a_type % 4;
+   b_type_sort = b_type % 4;
+   return a_type_sort > b_type_sort ? 1 : -1;
+}
+
 static int get_short_oid(const char *name, int len, struct object_id *oid,
  unsigned flags)
 {
@@ -409,6 +437,8 @@ static int get_short_oid(const char *name, int len, struct 
object_id *oid,
status = finish_object_disambiguation(&ds, oid);
 
if (!quietly && (status == SHORT_NAME_AMBIGUOUS)) {
+   struct oid_array collect = OID_ARRAY_INIT;
+
error(_("short SHA1 %s is ambiguous"), ds.hex_pfx);
 
/*
@@