Re: [PATCH 4/9] get_short_oid: sort ambiguous objects by type, then SHA-1
On 5/1/2018 8:36 AM, Ævar Arnfjörð Bjarmason wrote: On Tue, May 01 2018, Derrick Stolee wrote: How would sorting in our custom order before de-duplicating fail the de-duplication? We will still pair identical OIDs as consecutive elements and oid_array_for_each_unique only cares about consecutive elements having distinct OIDs, not lex-ordered OIDs. Because there's no de-duplication without the array first being sorted in oidcmp() order, which oid_array_for_each_unique() checks for and re-sorts if !array->sorted. I.e. its de-duplication is just a state machine where it won't call the callback if the currently processed element has the same SHA1 as the last one. Perhaps the noise is because we rely on oid_array_sort() to mark the array as sorted inside oid_array_for_each_unique(), but that could be remedied by calling our QSORT() inside for_each_abbrev() and marking the array as sorted before calling oid_array_for_each_unique(). As noted above this won't work, because the function inherently relies on the array being sorted to be able to de-duplicate. Doing this will yield duplicate entries. I'm confused as to why my suggestion doesn't work, so I made it concrete. I sent an alternate commit 6/12 to your v2 series [1]. Thanks, -Stolee [1] https://public-inbox.org/git/20180501130318.58251-1-dsto...@microsoft.com/T/#u
Re: [PATCH 4/9] get_short_oid: sort ambiguous objects by type, then SHA-1
On Tue, May 01 2018, Derrick Stolee wrote: > On 5/1/2018 7:27 AM, Ævar Arnfjörð Bjarmason wrote: >> On Tue, May 01 2018, Derrick Stolee wrote: >> >>> On 4/30/2018 6:07 PM, Ævar Arnfjörð Bjarmason wrote: Since we show the commit data in the output that's nicely aligned once we sort by object type. The decision to show tags before commits is pretty arbitrary, but it's much less likely that we'll display a tag, so if there is one it makes sense to show it first. >>> Here's a non-arbitrary reason: the object types are ordered >>> topologically (ignoring self-references): >>> >>> tag -> commit, tree, blob >>> commit -> tree >>> tree -> blob >> Thanks. I'll add a patch with that comment to v2. >> @@ -421,7 +451,12 @@ static int get_short_oid(const char *name, int len, struct object_id *oid, ds.fn = NULL; advise(_("The candidates are:")); - for_each_abbrev(ds.hex_pfx, show_ambiguous_object, &ds); + for_each_abbrev(ds.hex_pfx, collect_ambiguous, &collect); + QSORT(collect.oid, collect.nr, sort_ambiguous); >>> I was wondering how the old code sorted by SHA even when the ambiguous >>> objects were loaded from different sources (multiple pack-files, loose >>> objects). Turns out that for_each_abbrev() does its own sort after >>> collecting the SHAs and then calls the given function pointer only >>> once per distinct object. This avoids multiple instances of the same >>> object, which may appear multiple times across pack-files. >>> >>> I only ask because now we are doing two sorts. I wonder if it would be >>> more elegant to provide your sorting algorithm to for_each_abbrev() >>> and let it call show_ambiguous_object as before. >>> >>> Another question is if we should use this sort generally for all calls >>> to for_each_abbrev(). The only other case I see is in >>> builtin/revparse.c. >> When preparing v2 I realized how confusing this was, so I'd added this >> to the commit message of my WIP re-roll which should explain this: >> >> A note on the implementation: I started out with something much >> simpler which just replaced oid_array_sort() in sha1-array.c with a >> custom sort function before calling oid_array_for_each_unique(). But >> then dumbly noticed that it doesn't work because the output function >> was tangled up with the code added in fad6b9e590 ("for_each_abbrev: >> drop duplicate objects", 2016-09-26) to ensure we don't display >> duplicate objects. >> That's why we're doing two passes here, first we need to >> sort the list >> and de-duplicate the objects, then sort them in our custom order, and >> finally output them without re-sorting them. I suppose we could also >> make oid_array_for_each_unique() maintain a hashmap of emitted >> objects, but that would increase its memory profile and wouldn't be >> worth the complexity for this one-off use-case, >> oid_array_for_each_unique() is used in many other places. > > How would sorting in our custom order before de-duplicating fail the > de-duplication? We will still pair identical OIDs as consecutive > elements and oid_array_for_each_unique only cares about consecutive > elements having distinct OIDs, not lex-ordered OIDs. Because there's no de-duplication without the array first being sorted in oidcmp() order, which oid_array_for_each_unique() checks for and re-sorts if !array->sorted. I.e. its de-duplication is just a state machine where it won't call the callback if the currently processed element has the same SHA1 as the last one. > Perhaps the noise is because we rely on oid_array_sort() to mark the > array as sorted inside oid_array_for_each_unique(), but that could be > remedied by calling our QSORT() inside for_each_abbrev() and marking > the array as sorted before calling oid_array_for_each_unique(). As noted above this won't work, because the function inherently relies on the array being sorted to be able to de-duplicate. Doing this will yield duplicate entries.
Re: [PATCH 4/9] get_short_oid: sort ambiguous objects by type, then SHA-1
On 5/1/2018 7:27 AM, Ævar Arnfjörð Bjarmason wrote: On Tue, May 01 2018, Derrick Stolee wrote: On 4/30/2018 6:07 PM, Ævar Arnfjörð Bjarmason wrote: Since we show the commit data in the output that's nicely aligned once we sort by object type. The decision to show tags before commits is pretty arbitrary, but it's much less likely that we'll display a tag, so if there is one it makes sense to show it first. Here's a non-arbitrary reason: the object types are ordered topologically (ignoring self-references): tag -> commit, tree, blob commit -> tree tree -> blob Thanks. I'll add a patch with that comment to v2. @@ -421,7 +451,12 @@ static int get_short_oid(const char *name, int len, struct object_id *oid, ds.fn = NULL; advise(_("The candidates are:")); - for_each_abbrev(ds.hex_pfx, show_ambiguous_object, &ds); + for_each_abbrev(ds.hex_pfx, collect_ambiguous, &collect); + QSORT(collect.oid, collect.nr, sort_ambiguous); I was wondering how the old code sorted by SHA even when the ambiguous objects were loaded from different sources (multiple pack-files, loose objects). Turns out that for_each_abbrev() does its own sort after collecting the SHAs and then calls the given function pointer only once per distinct object. This avoids multiple instances of the same object, which may appear multiple times across pack-files. I only ask because now we are doing two sorts. I wonder if it would be more elegant to provide your sorting algorithm to for_each_abbrev() and let it call show_ambiguous_object as before. Another question is if we should use this sort generally for all calls to for_each_abbrev(). The only other case I see is in builtin/revparse.c. When preparing v2 I realized how confusing this was, so I'd added this to the commit message of my WIP re-roll which should explain this: A note on the implementation: I started out with something much simpler which just replaced oid_array_sort() in sha1-array.c with a custom sort function before calling oid_array_for_each_unique(). But then dumbly noticed that it doesn't work because the output function was tangled up with the code added in fad6b9e590 ("for_each_abbrev: drop duplicate objects", 2016-09-26) to ensure we don't display duplicate objects. That's why we're doing two passes here, first we need to sort the list and de-duplicate the objects, then sort them in our custom order, and finally output them without re-sorting them. I suppose we could also make oid_array_for_each_unique() maintain a hashmap of emitted objects, but that would increase its memory profile and wouldn't be worth the complexity for this one-off use-case, oid_array_for_each_unique() is used in many other places. How would sorting in our custom order before de-duplicating fail the de-duplication? We will still pair identical OIDs as consecutive elements and oid_array_for_each_unique only cares about consecutive elements having distinct OIDs, not lex-ordered OIDs. Perhaps the noise is because we rely on oid_array_sort() to mark the array as sorted inside oid_array_for_each_unique(), but that could be remedied by calling our QSORT() inside for_each_abbrev() and marking the array as sorted before calling oid_array_for_each_unique(). (Again, my comments are not meant to block this series.) Thanks, -Stolee
Re: [PATCH 4/9] get_short_oid: sort ambiguous objects by type, then SHA-1
On Tue, May 01 2018, Derrick Stolee wrote: > On 4/30/2018 6:07 PM, Ævar Arnfjörð Bjarmason wrote: >> Since we show the commit data in the output that's nicely aligned once >> we sort by object type. The decision to show tags before commits is >> pretty arbitrary, but it's much less likely that we'll display a tag, >> so if there is one it makes sense to show it first. > > Here's a non-arbitrary reason: the object types are ordered > topologically (ignoring self-references): > > tag -> commit, tree, blob > commit -> tree > tree -> blob Thanks. I'll add a patch with that comment to v2. >> @@ -421,7 +451,12 @@ static int get_short_oid(const char *name, int len, >> struct object_id *oid, >> ds.fn = NULL; >> advise(_("The candidates are:")); >> -for_each_abbrev(ds.hex_pfx, show_ambiguous_object, &ds); >> +for_each_abbrev(ds.hex_pfx, collect_ambiguous, &collect); >> +QSORT(collect.oid, collect.nr, sort_ambiguous); > > I was wondering how the old code sorted by SHA even when the ambiguous > objects were loaded from different sources (multiple pack-files, loose > objects). Turns out that for_each_abbrev() does its own sort after > collecting the SHAs and then calls the given function pointer only > once per distinct object. This avoids multiple instances of the same > object, which may appear multiple times across pack-files. > > I only ask because now we are doing two sorts. I wonder if it would be > more elegant to provide your sorting algorithm to for_each_abbrev() > and let it call show_ambiguous_object as before. > > Another question is if we should use this sort generally for all calls > to for_each_abbrev(). The only other case I see is in > builtin/revparse.c. When preparing v2 I realized how confusing this was, so I'd added this to the commit message of my WIP re-roll which should explain this: A note on the implementation: I started out with something much simpler which just replaced oid_array_sort() in sha1-array.c with a custom sort function before calling oid_array_for_each_unique(). But then dumbly noticed that it doesn't work because the output function was tangled up with the code added in fad6b9e590 ("for_each_abbrev: drop duplicate objects", 2016-09-26) to ensure we don't display duplicate objects. That's why we're doing two passes here, first we need to sort the list and de-duplicate the objects, then sort them in our custom order, and finally output them without re-sorting them. I suppose we could also make oid_array_for_each_unique() maintain a hashmap of emitted objects, but that would increase its memory profile and wouldn't be worth the complexity for this one-off use-case, oid_array_for_each_unique() is used in many other places.
Re: [PATCH 4/9] get_short_oid: sort ambiguous objects by type, then SHA-1
On 4/30/2018 6:07 PM, Ævar Arnfjörð Bjarmason wrote: Change the output emitted when an ambiguous object is encountered so that we show tags first, then commits, followed by trees, and finally blobs. Within each type we show objects in hashcmp(). Before this change the objects were only ordered by hashcmp(). The reason for doing this is that the output looks better as a result, e.g. the v2.17.0 tag before this change on "git show e8f2" would display: hint: The candidates are: hint: e8f2093055 tree hint: e8f21caf94 commit 2013-06-24 - bash prompt: print unique detached HEAD abbreviated object name hint: e8f21d02f7 blob hint: e8f21d577c blob hint: e8f25a3a50 tree hint: e8f26250fa commit 2017-02-03 - Merge pull request #996 from jeffhostetler/jeffhostetler/register_rename_src hint: e8f2650052 tag v2.17.0 hint: e8f2867228 blob hint: e8f28d537c tree hint: e8f2a35526 blob hint: e8f2bc0c06 commit 2015-05-10 - Documentation: note behavior for multiple remote.url entries hint: e8f2cf6ec0 tree Now we'll instead show: hint: e8f2650052 tag v2.17.0 hint: e8f21caf94 commit 2013-06-24 - bash prompt: print unique detached HEAD abbreviated object name hint: e8f26250fa commit 2017-02-03 - Merge pull request #996 from jeffhostetler/jeffhostetler/register_rename_src hint: e8f2bc0c06 commit 2015-05-10 - Documentation: note behavior for multiple remote.url entries hint: e8f2093055 tree hint: e8f25a3a50 tree hint: e8f28d537c tree hint: e8f2cf6ec0 tree hint: e8f21d02f7 blob hint: e8f21d577c blob hint: e8f2867228 blob hint: e8f2a35526 blob Since we show the commit data in the output that's nicely aligned once we sort by object type. The decision to show tags before commits is pretty arbitrary, but it's much less likely that we'll display a tag, so if there is one it makes sense to show it first. Here's a non-arbitrary reason: the object types are ordered topologically (ignoring self-references): tag -> commit, tree, blob commit -> tree tree -> blob Signed-off-by: Ævar Arnfjörð Bjarmason --- sha1-array.c | 15 +++ sha1-array.h | 3 +++ sha1-name.c | 37 - 3 files changed, 54 insertions(+), 1 deletion(-) diff --git a/sha1-array.c b/sha1-array.c index 838b3bf847..48bd9e9230 100644 --- a/sha1-array.c +++ b/sha1-array.c @@ -41,6 +41,21 @@ void oid_array_clear(struct oid_array *array) array->sorted = 0; } + +int oid_array_for_each(struct oid_array *array, + for_each_oid_fn fn, + void *data) +{ + int i; + + for (i = 0; i < array->nr; i++) { + int ret = fn(array->oid + i, data); + if (ret) + return ret; + } + return 0; +} + int oid_array_for_each_unique(struct oid_array *array, for_each_oid_fn fn, void *data) diff --git a/sha1-array.h b/sha1-array.h index 1e1d24b009..232bf95017 100644 --- a/sha1-array.h +++ b/sha1-array.h @@ -16,6 +16,9 @@ void oid_array_clear(struct oid_array *array); typedef int (*for_each_oid_fn)(const struct object_id *oid, void *data); +int oid_array_for_each(struct oid_array *array, + for_each_oid_fn fn, + void *data); int oid_array_for_each_unique(struct oid_array *array, for_each_oid_fn fn, void *data); diff --git a/sha1-name.c b/sha1-name.c index 9d7bbd3e96..46d8b1afa6 100644 --- a/sha1-name.c +++ b/sha1-name.c @@ -378,6 +378,34 @@ static int collect_ambiguous(const struct object_id *oid, void *data) return 0; } +static int sort_ambiguous(const void *a, const void *b) +{ + int a_type = oid_object_info(a, NULL); + int b_type = oid_object_info(b, NULL); + int a_type_sort; + int b_type_sort; + + /* +* Sorts by hash within the same object type, just as +* oid_array_for_each_unique() would do. +*/ + if (a_type == b_type) + return oidcmp(a, b); + + /* +* Between object types show tags, then commits, and finally +* trees and blobs. +* +* The object_type enum is commit, tree, blob, tag, but we +* want tag, commit, tree blob. Cleverly (perhaps too +* cleverly) do that with modulus, since the enum assigns 1 to +* commit, so tag becomes 0. +*/ I appreciate this comment. Clever things should be marked as such. + a_type_sort = a_type % 4; + b_type_sort = b_type % 4; + return a_type_sort > b_type_sort ? 1 : -1; +} + static int get_short_oid(const char *name, int len, struct object_id *oid, unsigned flags) { @@ -409,6 +437,8 @@ static int get_sho
[PATCH 4/9] get_short_oid: sort ambiguous objects by type, then SHA-1
Change the output emitted when an ambiguous object is encountered so that we show tags first, then commits, followed by trees, and finally blobs. Within each type we show objects in hashcmp(). Before this change the objects were only ordered by hashcmp(). The reason for doing this is that the output looks better as a result, e.g. the v2.17.0 tag before this change on "git show e8f2" would display: hint: The candidates are: hint: e8f2093055 tree hint: e8f21caf94 commit 2013-06-24 - bash prompt: print unique detached HEAD abbreviated object name hint: e8f21d02f7 blob hint: e8f21d577c blob hint: e8f25a3a50 tree hint: e8f26250fa commit 2017-02-03 - Merge pull request #996 from jeffhostetler/jeffhostetler/register_rename_src hint: e8f2650052 tag v2.17.0 hint: e8f2867228 blob hint: e8f28d537c tree hint: e8f2a35526 blob hint: e8f2bc0c06 commit 2015-05-10 - Documentation: note behavior for multiple remote.url entries hint: e8f2cf6ec0 tree Now we'll instead show: hint: e8f2650052 tag v2.17.0 hint: e8f21caf94 commit 2013-06-24 - bash prompt: print unique detached HEAD abbreviated object name hint: e8f26250fa commit 2017-02-03 - Merge pull request #996 from jeffhostetler/jeffhostetler/register_rename_src hint: e8f2bc0c06 commit 2015-05-10 - Documentation: note behavior for multiple remote.url entries hint: e8f2093055 tree hint: e8f25a3a50 tree hint: e8f28d537c tree hint: e8f2cf6ec0 tree hint: e8f21d02f7 blob hint: e8f21d577c blob hint: e8f2867228 blob hint: e8f2a35526 blob Since we show the commit data in the output that's nicely aligned once we sort by object type. The decision to show tags before commits is pretty arbitrary, but it's much less likely that we'll display a tag, so if there is one it makes sense to show it first. Signed-off-by: Ævar Arnfjörð Bjarmason --- sha1-array.c | 15 +++ sha1-array.h | 3 +++ sha1-name.c | 37 - 3 files changed, 54 insertions(+), 1 deletion(-) diff --git a/sha1-array.c b/sha1-array.c index 838b3bf847..48bd9e9230 100644 --- a/sha1-array.c +++ b/sha1-array.c @@ -41,6 +41,21 @@ void oid_array_clear(struct oid_array *array) array->sorted = 0; } + +int oid_array_for_each(struct oid_array *array, + for_each_oid_fn fn, + void *data) +{ + int i; + + for (i = 0; i < array->nr; i++) { + int ret = fn(array->oid + i, data); + if (ret) + return ret; + } + return 0; +} + int oid_array_for_each_unique(struct oid_array *array, for_each_oid_fn fn, void *data) diff --git a/sha1-array.h b/sha1-array.h index 1e1d24b009..232bf95017 100644 --- a/sha1-array.h +++ b/sha1-array.h @@ -16,6 +16,9 @@ void oid_array_clear(struct oid_array *array); typedef int (*for_each_oid_fn)(const struct object_id *oid, void *data); +int oid_array_for_each(struct oid_array *array, + for_each_oid_fn fn, + void *data); int oid_array_for_each_unique(struct oid_array *array, for_each_oid_fn fn, void *data); diff --git a/sha1-name.c b/sha1-name.c index 9d7bbd3e96..46d8b1afa6 100644 --- a/sha1-name.c +++ b/sha1-name.c @@ -378,6 +378,34 @@ static int collect_ambiguous(const struct object_id *oid, void *data) return 0; } +static int sort_ambiguous(const void *a, const void *b) +{ + int a_type = oid_object_info(a, NULL); + int b_type = oid_object_info(b, NULL); + int a_type_sort; + int b_type_sort; + + /* +* Sorts by hash within the same object type, just as +* oid_array_for_each_unique() would do. +*/ + if (a_type == b_type) + return oidcmp(a, b); + + /* +* Between object types show tags, then commits, and finally +* trees and blobs. +* +* The object_type enum is commit, tree, blob, tag, but we +* want tag, commit, tree blob. Cleverly (perhaps too +* cleverly) do that with modulus, since the enum assigns 1 to +* commit, so tag becomes 0. +*/ + a_type_sort = a_type % 4; + b_type_sort = b_type % 4; + return a_type_sort > b_type_sort ? 1 : -1; +} + static int get_short_oid(const char *name, int len, struct object_id *oid, unsigned flags) { @@ -409,6 +437,8 @@ static int get_short_oid(const char *name, int len, struct object_id *oid, status = finish_object_disambiguation(&ds, oid); if (!quietly && (status == SHORT_NAME_AMBIGUOUS)) { + struct oid_array collect = OID_ARRAY_INIT; + error(_("short SHA1 %s is ambiguous"), ds.hex_pfx); /* @@