Re: [PATCH 1/7] cat-file: disable object/refname ambiguity check for batch mode
Jeff King p...@peff.net writes: To cat-file we could add an option like --sha1-only or --literal or --no-dwim (... better names are failing me) which would skip *all* dwimming of 40-character strings. It would also assume that any shorter strings are abbreviated SHA-1s and fail if they are not. This would be a nice feature by itself (these are object names, dammit, and don't try to tell me differently!) and would have the additional small advantage of speeding up lookups of abbreviated SHA-1s, which (regardless of your patch) otherwise go through the whole DWIM process. I can see in theory that somebody might want that, but I am having a hard time thinking of a practical use. Would it be a good alternative to call get_sha1_hex() to catch the most common case (reading from rev-list output, for example) and then let the more general get_sha1() to let extended SHA-1 to be handled? IOW, it seems like a poor default, and we are choosing it only because of backwards compatibility. I guess another option is to switch the default with the usual deprecation dance. I agree that did you mean the unreadable refname or 40-hex object? turned on everywhere get_sha1() is called is a very poor default. I wonder if we can limit it only to the end-user input somehow at the API level. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/7] cat-file: disable object/refname ambiguity check for batch mode
On Sun, Jul 14, 2013 at 08:45:37PM -0700, Junio C Hamano wrote: To cat-file we could add an option like --sha1-only or --literal or --no-dwim (... better names are failing me) which would skip *all* dwimming of 40-character strings. It would also assume that any shorter strings are abbreviated SHA-1s and fail if they are not. This would be a nice feature by itself (these are object names, dammit, and don't try to tell me differently!) and would have the additional small advantage of speeding up lookups of abbreviated SHA-1s, which (regardless of your patch) otherwise go through the whole DWIM process. I can see in theory that somebody might want that, but I am having a hard time thinking of a practical use. Would it be a good alternative to call get_sha1_hex() to catch the most common case (reading from rev-list output, for example) and then let the more general get_sha1() to let extended SHA-1 to be handled? For a 40-byte sha1, that _should_ be what get_sha1 does (i.e., go more or less directly to the 40-hex code path, and return). And that's basically what happens now, except that after we do so, we now have the extra oh, is it also a refname? check. For a shortened sha1, I don't think it would have the same behavior. Right now, I believe the order is to treat a short sha1 as a possible refname, and only if that fails consider it as a short sha1. IOW, it seems like a poor default, and we are choosing it only because of backwards compatibility. I guess another option is to switch the default with the usual deprecation dance. I agree that did you mean the unreadable refname or 40-hex object? turned on everywhere get_sha1() is called is a very poor default. I wonder if we can limit it only to the end-user input somehow at the API level. It is easy to do on top of my patch (just flip the default on the switch I introduced, and turn it back on in whichever code paths are appropriate). But the question is: what is end-user input? Do you mean command-line arguments to rev-list and friends? -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/7] cat-file: disable object/refname ambiguity check for batch mode
On Fri, Jul 12, 2013 at 12:30:07PM +0200, Michael Haggerty wrote: But with particular respect to git cat-file, I see problems: 1. get_ref_snapshot() would have to read all loose and packed refs within the specified subtree, because loose refs have to be read before packed refs. So the call would be expensive if there are a lot of loose refs. And DWIM wouldn't know in advance where the references might be, so it would have to set prefix=. If many refs are looked up, then it would presumably be worth it. But if only a couple of lookups are done and there are a lot of loose refs, then using a cache would probably slow things down. The slowdown could be ameliorated by adding some more intelligence, for example only populating the loose refs cache after a certain number of lookups have already been done. I had assumed we would load the loose ref cache progressively by directory. Of course that only helps avoiding refs/foo/* when you are interested in refs/heads/foo. If your refs/heads/* is very large, you have to load all of it, and at some point it is cheaper to do a few stat() calls than to actually readdir() the whole directory. On the other hand, that is basically how packed-refs work now (if we hit any ref, we have to load the whole file), and repos with many refs would typically pack them all anyway. 2. A git cat-file --batch process can be long-lived. What guarantees would users expect regarding its lookup results? I hadn't really intended this for general lookups, but just to speed up the refname warning, where being out-of-date is more acceptable (since the warning is a purely informational hint). -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/7] cat-file: disable object/refname ambiguity check for batch mode
A common use of cat-file --batch-check is to feed a list of objects from rev-list --objects or a similar command. In this instance, all of our input objects are 40-byte sha1 ids. However, cat-file has always allowed arbitrary revision specifiers, and feeds the result to get_sha1(). Fortunately, get_sha1() recognizes a 40-byte sha1 before doing any hard work trying to look up refs, meaning this scenario should end up spending very little time converting the input into an object sha1. However, since 798c35f (get_sha1: warn about full or short object names that look like refs, 2013-05-29), when we encounter this case, we spend the extra effort to do a refname lookup anyway, just to print a warning. This is further exacerbated by ca91993 (get_packed_ref_cache: reload packed-refs file when it changes, 2013-06-20), which makes individual ref lookup more expensive by requiring a stat() of the packed-refs file for each missing ref. With no patches, this is the time it takes to run: $ git rev-list --objects --all objects $ time git cat-file --batch-check='%(objectname)' objects on the linux.git repository: real1m13.494s user0m25.924s sys 0m47.532s If we revert ca91993, the packed-refs up-to-date check, it gets a little better: real0m54.697s user0m21.692s sys 0m32.916s but we are still spending quite a bit of time on ref lookup (and we would not want to revert that patch, anyway, which has correctness issues). If we revert 798c35f, disabling the warning entirely, we get a much more reasonable time: real0m7.452s user0m6.836s sys 0m0.608s This patch does the moral equivalent of this final case (and gets similar speedups). We introduce a global flag that callers of get_sha1() can use to avoid paying the price for the warning. Signed-off-by: Jeff King p...@peff.net --- The solution feels a little hacky, but I'm not sure there is a better one short of reverting the warning entirely. We could also tie it to warn_ambiguous_refs (or add another config variable), but I don't think that makes sense. It is not about do I care about ambiguities in this repository, but rather I am going to do a really large number of sha1 resolutions, and I do not want to pay the price in this particular code path for the warning). There may be other sites that resolve a large number of refs and run into this, but I couldn't think of any. Doing for_each_ref would not have the same problem, as we already know they are refs there. builtin/cat-file.c | 9 + cache.h| 1 + environment.c | 1 + sha1_name.c| 14 -- 4 files changed, 19 insertions(+), 6 deletions(-) diff --git a/builtin/cat-file.c b/builtin/cat-file.c index 0e64b41..fe5c77f 100644 --- a/builtin/cat-file.c +++ b/builtin/cat-file.c @@ -266,6 +266,15 @@ static int batch_objects(struct batch_options *opt) strbuf_expand(buf, opt-format, expand_format, data); data.mark_query = 0; + /* +* We are going to call get_sha1 on a potentially very large number of +* objects. In most large cases, these will be actual object sha1s. The +* cost to double-check that each one is not also a ref (just so we can +* warn) ends up dwarfing the actual cost of the object lookups +* themselves. We can work around it by just turning off the warning. +*/ + warn_on_object_refname_ambiguity = 0; + while (strbuf_getline(buf, stdin, '\n') != EOF) { char *p; int error; diff --git a/cache.h b/cache.h index 2d06169..c1fd82c 100644 --- a/cache.h +++ b/cache.h @@ -575,6 +575,7 @@ extern int warn_ambiguous_refs; extern int prefer_symlink_refs; extern int log_all_ref_updates; extern int warn_ambiguous_refs; +extern int warn_on_object_refname_ambiguity; extern int shared_repository; extern const char *apply_default_whitespace; extern const char *apply_default_ignorewhitespace; diff --git a/environment.c b/environment.c index 0cb67b2..5398c36 100644 --- a/environment.c +++ b/environment.c @@ -22,6 +22,7 @@ int warn_ambiguous_refs = 1; int is_bare_repository_cfg = -1; /* unspecified */ int log_all_ref_updates = -1; /* unspecified */ int warn_ambiguous_refs = 1; +int warn_on_object_refname_ambiguity = 1; int repository_format_version; const char *git_commit_encoding; const char *git_log_output_encoding; diff --git a/sha1_name.c b/sha1_name.c index 90419ef..6f8812a 100644 --- a/sha1_name.c +++ b/sha1_name.c @@ -452,13 +452,15 @@ static int get_sha1_basic(const char *str, int len, unsigned char *sha1) int at, reflog_len, nth_prior = 0; if (len == 40 !get_sha1_hex(str, sha1)) { - refs_found = dwim_ref(str, len, tmp_sha1, real_ref); - if (refs_found 0 warn_ambiguous_refs) { - warning(warn_msg, len, str); - if (advice_object_name_warning) - fprintf(stderr, %s\n,
Re: [PATCH 1/7] cat-file: disable object/refname ambiguity check for batch mode
On 07/12/2013 08:20 AM, Jeff King wrote: A common use of cat-file --batch-check is to feed a list of objects from rev-list --objects or a similar command. In this instance, all of our input objects are 40-byte sha1 ids. However, cat-file has always allowed arbitrary revision specifiers, and feeds the result to get_sha1(). Fortunately, get_sha1() recognizes a 40-byte sha1 before doing any hard work trying to look up refs, meaning this scenario should end up spending very little time converting the input into an object sha1. However, since 798c35f (get_sha1: warn about full or short object names that look like refs, 2013-05-29), when we encounter this case, we spend the extra effort to do a refname lookup anyway, just to print a warning. This is further exacerbated by ca91993 (get_packed_ref_cache: reload packed-refs file when it changes, 2013-06-20), which makes individual ref lookup more expensive by requiring a stat() of the packed-refs file for each missing ref. With no patches, this is the time it takes to run: $ git rev-list --objects --all objects $ time git cat-file --batch-check='%(objectname)' objects on the linux.git repository: real1m13.494s user0m25.924s sys 0m47.532s If we revert ca91993, the packed-refs up-to-date check, it gets a little better: real0m54.697s user0m21.692s sys 0m32.916s but we are still spending quite a bit of time on ref lookup (and we would not want to revert that patch, anyway, which has correctness issues). If we revert 798c35f, disabling the warning entirely, we get a much more reasonable time: real0m7.452s user0m6.836s sys 0m0.608s This patch does the moral equivalent of this final case (and gets similar speedups). We introduce a global flag that callers of get_sha1() can use to avoid paying the price for the warning. Signed-off-by: Jeff King p...@peff.net --- The solution feels a little hacky, but I'm not sure there is a better one short of reverting the warning entirely. We could also tie it to warn_ambiguous_refs (or add another config variable), but I don't think that makes sense. It is not about do I care about ambiguities in this repository, but rather I am going to do a really large number of sha1 resolutions, and I do not want to pay the price in this particular code path for the warning). There may be other sites that resolve a large number of refs and run into this, but I couldn't think of any. Doing for_each_ref would not have the same problem, as we already know they are refs there. ISTM that callers of --batch-check would usually know whether they are feeding it SHA-1s or arbitrary object specifiers. (And in most cases when there are a large number of them, they are probably all SHA-1s). So instead of framing this change as skip object refname ambiguity check (i.e., sacrifice some correctness for performance), perhaps it would be less hacky to offer the following new feature: To cat-file we could add an option like --sha1-only or --literal or --no-dwim (... better names are failing me) which would skip *all* dwimming of 40-character strings. It would also assume that any shorter strings are abbreviated SHA-1s and fail if they are not. This would be a nice feature by itself (these are object names, dammit, and don't try to tell me differently!) and would have the additional small advantage of speeding up lookups of abbreviated SHA-1s, which (regardless of your patch) otherwise go through the whole DWIM process. I understand that such an option alone would leave some old scripts suffering the performance regression caused by the check, but in most cases it would be easy to fix them. Michael -- Michael Haggerty mhag...@alum.mit.edu http://softwareswirl.blogspot.com/ -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/7] cat-file: disable object/refname ambiguity check for batch mode
On Fri, Jul 12, 2013 at 10:47:46AM +0200, Michael Haggerty wrote: The solution feels a little hacky, but I'm not sure there is a better one short of reverting the warning entirely. We could also tie it to warn_ambiguous_refs (or add another config variable), but I don't think that makes sense. It is not about do I care about ambiguities in this repository, but rather I am going to do a really large number of sha1 resolutions, and I do not want to pay the price in this particular code path for the warning). There may be other sites that resolve a large number of refs and run into this, but I couldn't think of any. Doing for_each_ref would not have the same problem, as we already know they are refs there. ISTM that callers of --batch-check would usually know whether they are feeding it SHA-1s or arbitrary object specifiers. (And in most cases when there are a large number of them, they are probably all SHA-1s). Thanks for bringing this up; I had meant to outline this option in the cover letter, but forgot when posting. If were designing cat-file from scratch, I'd consider having --batch take just 40-hex sha1s in the first place. Since we can't do that now for backwards compatibility, having an option is the best we can do there. But having to use such an option to avoid a 10x slowdown just strikes me as ugly for two reasons: 1. It's part of the user-visible interface, and it seems like an extra wtf? moment when somebody wonders why their script is painfully slow, and why they need a magic option to fix it. We're cluttering the interface forever. 2. It's a sign that the refname check was put in for a specific performance profile (resolving one or a few refs), but didn't consider resolving a large number. I'm wondering what other performance regressions it's possible to trigger. So instead of framing this change as skip object refname ambiguity check (i.e., sacrifice some correctness for performance), perhaps it would be less hacky to offer the following new feature: I wouldn't frame it as correctness for performance at all. The check does not impact behavior at all, and is purely informational (to catch quite a rare situation, too). I'd argue that the check simply isn't that interesting in this case in the first place. To cat-file we could add an option like --sha1-only or --literal or --no-dwim (... better names are failing me) which would skip *all* dwimming of 40-character strings. It would also assume that any shorter strings are abbreviated SHA-1s and fail if they are not. This would be a nice feature by itself (these are object names, dammit, and don't try to tell me differently!) and would have the additional small advantage of speeding up lookups of abbreviated SHA-1s, which (regardless of your patch) otherwise go through the whole DWIM process. I can see in theory that somebody might want that, but I am having a hard time thinking of a practical use. I understand that such an option alone would leave some old scripts suffering the performance regression caused by the check, but in most cases it would be easy to fix them. I'm less worried about old scripts, and more worried about carrying around a confusing option forever, especially when I expect most invocations to use it (perhaps my experience is biased, but I use cat-file --batch quite a lot in scripting, and it has almost always been on the output of rev-list). IOW, it seems like a poor default, and we are choosing it only because of backwards compatibility. I guess another option is to switch the default with the usual deprecation dance. Yet another option is to consider what the check is doing, and accomplish the same thing in a different way. The real pain is that we are individually trying to resolve each object by hitting the filesystem (and doing lots of silly checks on the refname format, when we know it must be valid). We don't actually care in this case if the ref list is up to date (we are not trying to update or read a ref, but only know if it exists, and raciness is OK). IOW, could we replace the dwim_ref call for the warning with something that directly queries the ref cache? -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/7] cat-file: disable object/refname ambiguity check for batch mode
On 07/12/2013 11:22 AM, Jeff King wrote: Yet another option is to consider what the check is doing, and accomplish the same thing in a different way. The real pain is that we are individually trying to resolve each object by hitting the filesystem (and doing lots of silly checks on the refname format, when we know it must be valid). We don't actually care in this case if the ref list is up to date (we are not trying to update or read a ref, but only know if it exists, and raciness is OK). IOW, could we replace the dwim_ref call for the warning with something that directly queries the ref cache? I think it would be quite practical to add an API something like struct ref_snapshot *get_ref_snapshot(const char *prefix) void release_ref_snapshot(struct ref_snapshot *) int lookup_ref(struct ref_snapshot *, const char *refname, unsigned char *sha1, int *flags) where prefix is the part of the refs tree that you want included in the snapshot (e.g., refs/heads) and ref_snapshot is probably opaque outside of the refs module. Symbolic refs, which are currently not stored in the ref_cache, would have to be added because otherwise we would have to do all of the lookups anyway. I think this would be a good step to take for many reasons, including because it would be another useful step in the direction of ref transactions. But with particular respect to git cat-file, I see problems: 1. get_ref_snapshot() would have to read all loose and packed refs within the specified subtree, because loose refs have to be read before packed refs. So the call would be expensive if there are a lot of loose refs. And DWIM wouldn't know in advance where the references might be, so it would have to set prefix=. If many refs are looked up, then it would presumably be worth it. But if only a couple of lookups are done and there are a lot of loose refs, then using a cache would probably slow things down. The slowdown could be ameliorated by adding some more intelligence, for example only populating the loose refs cache after a certain number of lookups have already been done. 2. A git cat-file --batch process can be long-lived. What guarantees would users expect regarding its lookup results? Currently, its ref lookups reflect the state of the repo at the moment the commit identifier is written into the pipe. Using a cache like this would mean that ref lookups would always reflect the snapshot taken at the start of the git cat-file run, regardless of whether the script using it might have added or modified some references since then. I think this would have to be considered a regression. Michael -- Michael Haggerty mhag...@alum.mit.edu http://softwareswirl.blogspot.com/ -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html