Jeff King venit, vidit, dixit 05.02.2013 12:13:
> On Mon, Feb 04, 2013 at 04:27:31PM +0100, Michael J Gruber wrote:
>
>> Recently and not so recently, we made sure that log/grep type operations
>> use textconv filters when a userfacing diff would do the same:
>>
>> ef90ab6 (pickaxe: use textconv for -S counting, 2012-10-28)
>> b1c2f57 (diff_grep: use textconv buffers for add/deleted files, 2012-10-28)
>> 0508fe5 (combine-diff: respect textconv attributes, 2011-05-23)
>>
>> "git grep" currently does not use textconv filters at all, that is
>> neither for displaying the match and context nor for the actual grepping.
>>
>> Introduce a binary mode "--textconv" (in addition to "--text" and "-I")
>> which makes git grep use any configured textconv filters for grepping
>> and output purposes.
>
> Sounds like a reasonable goal.
>
>> The difficulty is in getting the different cases (blob/sha1 vs.
>> worktree) right, and in making the changes minimally invasive. It seems
>> that some more refactoring could help: "git show --textconv" does not
>> use textconv filters when used on blobs either. (It does for diffs, of
>> course.) Most existing helper functions are tailored for diffs.
>
> I think "git show" with blobs originally did not because we have no
> filename with which to look up the attributes. IIRC, the patches to
> support "cat-file --textconv" taught get_sha1_with_context to report the
> path at which we found a blob. I suspect it is mostly a matter of
> plumbing that information from the revision parser through to
> show_blob_object.
>
>> Nota bene: --textconv does not affect "diff --stat" either...
>
> Yeah, though I wonder if it should be on by default. The diffstat for a
> binary file, unlike the diff, is already useful. The diffstat of the
> textconv'd data may _also_ be useful, of course.
>
>> @@ -659,6 +659,9 @@ int cmd_grep(int argc, const char **argv, const char
>> *prefix)
>> OPT_SET_INT('I', NULL, &opt.binary,
>> N_("don't match patterns in binary files"),
>> GREP_BINARY_NOMATCH),
>> + OPT_SET_INT(0, "textconv", &opt.binary,
>> + N_("process binary files with textconv filters"),
>> + GREP_BINARY_TEXTCONV),
>
> Is this really a new form of GREP_BINARY_*? What happens when a file
> does not have a textconv filter?
>
> I would expect this to be more like diff's "--textconv" flag, which is
> really "allow textconv to be respected". Then you could do:
>
> git grep -I --textconv foo
>
> to grep in the text version of files which support it, and ignore the
> rest.
>
>> -static int grep_source_load(struct grep_source *gs);
>> -static int grep_source_is_binary(struct grep_source *gs);
>> +static int grep_source_load(struct grep_source *gs, struct grep_opt *opt);
>> +static int grep_source_is_binary(struct grep_source *gs, struct grep_opt
>> *opt);
>
> Hmm. grep_source_load is more or less the analogue of
> diff_populate_filespec, which does not know about textconv at all. So I
> feel like this might be going in at the wrong layer...
>
>> @@ -1354,14 +1356,15 @@ static int grep_source_1(struct grep_opt *opt,
>> struct grep_source *gs, int colle
>>
>> switch (opt->binary) {
>> case GREP_BINARY_DEFAULT:
>> - if (grep_source_is_binary(gs))
>> + if (grep_source_is_binary(gs, opt))
>> binary_match_only = 1;
>> break;
>> case GREP_BINARY_NOMATCH:
>> - if (grep_source_is_binary(gs))
>> + if (grep_source_is_binary(gs, opt))
>> return 0; /* Assume unmatch */
>> break;
>
> The is_binary function learned about "opt" so that it could pass it to
> grep_source_load, which might do the textconv for us. But we _know_ that
> we will not here, because we see that we have other GREP_BINARY flags.
>
> And when we do have _TEXTCONV:
>
>> case GREP_BINARY_TEXT:
>> + case GREP_BINARY_TEXTCONV:
>> break;
>
> We do not call is_binary. :)
>
>> -static int grep_source_load_sha1(struct grep_source *gs)
>> +static int grep_source_load_sha1(struct grep_source *gs, struct grep_opt
>> *opt)
>> {
>> enum object_type type;
>> -
>> grep_read_lock();
>> - gs->buf = read_sha1_file(gs->identifier, &type, &gs->size);
>> + if (opt->binary == GREP_BINARY_TEXTCONV) {
>> + struct diff_filespec *df = alloc_filespec(gs->name);
>> + gs->size = fill_textconv(gs->driver, df, &gs->buf);
>> + free_filespec(df);
>> + } else {
>> + gs->buf = read_sha1_file(gs->identifier, &type, &gs->size);
>> + }
>
> So here we do the textconv for the sha1 case. But what about file
> sources?
>
> This is why I think the layer is wrong; you want the fill_textconv
> function to call your abstract _load function (in the diff_filespec
> world, it is diff_filespec_populate, but it is the moral equivalent).
> And you want it to hold off as long as possible in case we can pull the
> value from cache, or feed the working tree version of a file straight to
> the filter.
>
>> -void grep_source_load_driver(struct grep_source *gs)
>> +void grep_source_load_driver(struct grep_source *gs, struct grep_opt *opt)
>> {
>> if (gs->driver)
>> return;
>>
>> - grep_attr_lock();
>> + grep_attr_lock(); //TODO
>> + printf("Looking up userdiff driver for: %s", gs->path);
>> if (gs->path)
>> gs->driver = userdiff_find_by_path(gs->path);
>> if (!gs->driver)
>> gs->driver = userdiff_find_by_name("default");
>> + if (opt->binary == GREP_BINARY_TEXTCONV)
>> + gs->driver = userdiff_get_textconv(gs->driver);
>> grep_attr_unlock();
>> }
>
> This is wrong. The point of userdiff_get_textconv is that it will return
> NULL when we are not doing textconv for this path. So you can use it
> like:
>
> struct userdiff_driver *textconv = userdiff_get_textconv(gs->driver);
>
> if (textconv) {
> /* ok, we are doing textconv. Call our fill_textconv
> * equivalent. */
> }
> else {
> /* nope, plain old file. */
> }
>
> But by assigning it on top of gs->driver, you're going to end up with a
> NULL driver sometimes. And the post-condition of the load_driver
> function is that gs->driver always points to a valid driver (even if it
> is the default one). I wouldn't be surprised if this causes segfaults.
>
>
> So I would do it more like the patch below. Only lightly tested by me.
> There are some refactoring opportunities if you want to bring
> grep_source and diff_filespec closer together.
>
> ---
> diff --git a/builtin/grep.c b/builtin/grep.c
> index 8025964..915c8ef 100644
> --- a/builtin/grep.c
> +++ b/builtin/grep.c
> @@ -659,6 +659,8 @@ int cmd_grep(int argc, const char **argv, const char
> *prefix)
> OPT_SET_INT('I', NULL, &opt.binary,
> N_("don't match patterns in binary files"),
> GREP_BINARY_NOMATCH),
> + OPT_BOOL(0, "textconv", &opt.allow_textconv,
> + N_("process binary files with textconv filters")),
> { OPTION_INTEGER, 0, "max-depth", &opt.max_depth, N_("depth"),
> N_("descend at most <depth> levels"), PARSE_OPT_NONEG,
> NULL, 1 },
> diff --git a/grep.c b/grep.c
> index 4bd1b8b..3880d64 100644
> --- a/grep.c
> +++ b/grep.c
> @@ -2,6 +2,8 @@
> #include "grep.h"
> #include "userdiff.h"
> #include "xdiff-interface.h"
> +#include "diff.h"
> +#include "diffcore.h"
>
> static int grep_source_load(struct grep_source *gs);
> static int grep_source_is_binary(struct grep_source *gs);
> @@ -1321,6 +1323,58 @@ static void std_output(struct grep_opt *opt, const
> void *buf, size_t size)
> fwrite(buf, size, 1, stdout);
> }
>
> +static int fill_textconv_grep(struct userdiff_driver *driver,
> + struct grep_source *gs)
> +{
> + struct diff_filespec *df;
> + char *buf;
> + size_t size;
> +
> + if (!driver || !driver->textconv)
> + return grep_source_load(gs);
> +
> + /*
> + * The textconv interface is intimately tied to diff_filespecs, so we
> + * have to pretend to be one. If we could unify the grep_source
> + * and diff_filespec structs, this mess could just go away.
> + */
> + df = alloc_filespec(gs->path);
> + switch (gs->type) {
> + case GREP_SOURCE_SHA1:
> + fill_filespec(df, gs->identifier, 1, 0100644);
> + break;
> + case GREP_SOURCE_FILE:
> + fill_filespec(df, null_sha1, 0, 0100644);
> + break;
> + default:
> + die("BUG: attempt to textconv something without a path?");
> + }
> +
> + /*
> + * fill_textconv is not remotely thread-safe; it may load objects
> + * behind the scenes, and it modifies the global diff tempfile
> + * structure.
> + */
> + grep_read_lock();
> + size = fill_textconv(driver, df, &buf);
> + grep_read_unlock();
> + free_filespec(df);
> +
> + /*
> + * The normal fill_textconv usage by the diff machinery would just keep
> + * the textconv'd buf separate from the diff_filespec. But much of the
> + * grep code passes around a grep_source and assumes that its "buf"
> + * pointer is the beginning of the thing we are searching. So let's
> + * install our textconv'd version into the grep_source, taking care not
> + * to leak any existing buffer.
> + */
> + grep_source_clear_data(gs);
> + gs->buf = buf;
> + gs->size = size;
> +
> + return 0;
> +}
> +
> static int grep_source_1(struct grep_opt *opt, struct grep_source *gs, int
> collect_hits)
> {
> char *bol;
> @@ -1331,6 +1385,7 @@ static int grep_source_1(struct grep_opt *opt, struct
> grep_source *gs, int colle
> unsigned count = 0;
> int try_lookahead = 0;
> int show_function = 0;
> + struct userdiff_driver *textconv = NULL;
> enum grep_context ctx = GREP_CONTEXT_HEAD;
> xdemitconf_t xecfg;
>
> @@ -1352,19 +1407,36 @@ static int grep_source_1(struct grep_opt *opt, struct
> grep_source *gs, int colle
> }
> opt->last_shown = 0;
>
> - switch (opt->binary) {
> - case GREP_BINARY_DEFAULT:
> - if (grep_source_is_binary(gs))
> - binary_match_only = 1;
> - break;
> - case GREP_BINARY_NOMATCH:
> - if (grep_source_is_binary(gs))
> - return 0; /* Assume unmatch */
> - break;
> - case GREP_BINARY_TEXT:
> - break;
> - default:
> - die("bug: unknown binary handling mode");
> + if (opt->allow_textconv) {
> + grep_source_load_driver(gs);
> + /*
> + * We might set up the shared textconv cache data here, which
> + * is not thread-safe.
> + */
> + grep_attr_lock();
> + textconv = userdiff_get_textconv(gs->driver);
> + grep_attr_unlock();
> + }
> +
> + /*
> + * We know the result of a textconv is text, so we only have to care
> + * about binary handling if we are not using it.
> + */
> + if (!textconv) {
> + switch (opt->binary) {
> + case GREP_BINARY_DEFAULT:
> + if (grep_source_is_binary(gs))
> + binary_match_only = 1;
> + break;
> + case GREP_BINARY_NOMATCH:
> + if (grep_source_is_binary(gs))
> + return 0; /* Assume unmatch */
> + break;
> + case GREP_BINARY_TEXT:
> + break;
> + default:
> + die("bug: unknown binary handling mode");
> + }
> }
>
> memset(&xecfg, 0, sizeof(xecfg));
> @@ -1372,7 +1444,7 @@ static int grep_source_1(struct grep_opt *opt, struct
> grep_source *gs, int colle
>
> try_lookahead = should_lookahead(opt);
>
> - if (grep_source_load(gs) < 0)
> + if (fill_textconv_grep(textconv, gs) < 0)
> return 0;
>
> bol = gs->buf;
> diff --git a/grep.h b/grep.h
> index 8fc854f..94a7ac2 100644
> --- a/grep.h
> +++ b/grep.h
> @@ -106,6 +106,7 @@ struct grep_opt {
> #define GREP_BINARY_NOMATCH 1
> #define GREP_BINARY_TEXT 2
> int binary;
> + int allow_textconv;
> int extended;
> int use_reflog_filter;
> int pcre;
>
Thanks Jeff, that helps a lot! It covers "grep expr" and "grep expr rev
-- path" just fine. I'll look into "grep expr rev:path" which does not
work yet because of an empty driver.
I also have "show --textconv" covered and a suggestion for "cat-file
--textconv" (to work without a textconv filter).
Expect a mini-series soon :)
Michael
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html