Re: [PATCH v2 4/4] diff: mark any file larger than core.bigfilethreshold binary

Thomas Braun Fri, 27 Jun 2014 11:57:13 -0700

Am 26.06.2014 19:55, schrieb Junio C Hamano:
> Nguyễn Thái Ngọc Duy  <[email protected]> writes:
> 
>> Too large files may lead to failure to allocate memory. If it happens
>> here, it could impact quite a few commands that involve
>> diff. Moreover, too large files are inefficient to compare anyway (and
>> most likely non-text), so mark them binary and skip looking at their
>> content.
>> ...
>> diff --git a/diff.c b/diff.c
>> index a489540..7a977aa 100644
>> --- a/diff.c
>> +++ b/diff.c
>> @@ -2185,8 +2185,8 @@ int diff_filespec_is_binary(struct diff_filespec *one)
>>                      one->is_binary = one->driver->binary;
>>              else {
>>                      if (!one->data && DIFF_FILE_VALID(one))
>> -                            diff_populate_filespec(one, 0);
>> -                    if (one->data)
>> +                            diff_populate_filespec(one, 
>> DIFF_POPULATE_IS_BINARY);
>> +                    if (one->is_binary == -1 && one->data)
>>                              one->is_binary = buffer_is_binary(one->data,
>>                                              one->size);
>>                      if (one->is_binary == -1)
> 
> The name is misleading and forced me to read it twice before I
> realized that this is "populating the is-binary bit".  It might make
> it a bit better if you renamed it to DIFF_POPULATE_IS_BINARY_BIT or
> perhaps DIFF_POPULATE_CHECK_BINARY or something.  For consistency,
> the other bit may want to be also renamed from SIZE_ONLY to either
> 
>  (1) CHECK_SIZE_ONLY
> 
>  (2) One bit for CHECK_SIZE, another for NO_CONTENTS, and optionally
>      make SIZE_ONLY the union of two
> 
> I do not have strong preference either way; the latter may be more
> logical in that "not loading contents" and "check size" are sort of
> orthogonal in that you can later choose to check, without loading
> contents, only the binary-ness without checking size, but no calles
> that passes a non-zero flag to the populate-filespec function will
> want to slurp in the contents in practice, so in that sense we could
> declare that the NO_CONENTS bit is implied.
> 
> But more importantly, would this patch actually help?  For one
> thing, this wouldn't (and shouldn't) help if the user wants --binary
> diff out of us anyway, I suspect, but I wonder what the following
> codepath in the builtin_diff() function would do:
> 
>               ...
>       } else if (!DIFF_OPT_TST(o, TEXT) &&
>           ( (!textconv_one && diff_filespec_is_binary(one)) ||
>             (!textconv_two && diff_filespec_is_binary(two)) )) {
>               if (fill_mmfile(&mf1, one) < 0 || fill_mmfile(&mf2, two) < 0)
>                       die("unable to read files to diff");
>               /* Quite common confusing case */
>               if (mf1.size == mf2.size &&
>                   !memcmp(mf1.ptr, mf2.ptr, mf1.size)) {
>                       if (must_show_header)
>                               fprintf(o->file, "%s", header.buf);
>                       goto free_ab_and_return;
>               }
>               fprintf(o->file, "%s", header.buf);
>               strbuf_reset(&header);
>               if (DIFF_OPT_TST(o, BINARY))
>                       emit_binary_diff(o->file, &mf1, &mf2, line_prefix);
>               else
>                       fprintf(o->file, "%sBinary files %s and %s differ\n",
>                               line_prefix, lbl[0], lbl[1]);
>               o->found_changes = 1;
>       } else {
>               ...
> 
> If we weren't told with --text/-a to force textual output, and
> at least one of the sides is marked as binary (and this patch marks
> a large blob as binary unless attributes says otherwise), we still
> call fill_mmfile() on them to slurp the contents to be compared, no?
> 
> And before you get to the DIFF_OPT_TST(o, BINARY), you memcmp(3) to
> check if the sides are identical, so...


Good point. So how about an additional change roughly sketched as

@@ -2223,6 +2223,14 @@ struct userdiff_driver *get_textconv(struct
diff_filespec *one)
        return userdiff_get_textconv(one->driver);
 }

+/* read the files in small chunks into memory and compare them */
+static int filecmp_chunked(struct diff_filespec *one,
+       struct diff_filespec *two)
+{
+       // TODO add implementation
+       return 0;
+}
+
 static void builtin_diff(const char *name_a,
                         const char *name_b,
                         struct diff_filespec *one,
@@ -2325,19 +2333,26 @@ static void builtin_diff(const char *name_a,
        } else if (!DIFF_OPT_TST(o, TEXT) &&
            ( (!textconv_one && diff_filespec_is_binary(one)) ||
              (!textconv_two && diff_filespec_is_binary(two)) )) {
-               if (fill_mmfile(&mf1, one) < 0 || fill_mmfile(&mf2,two)< 0)
-                       die("unable to read files to diff");
+
+               unsigned long size1 = diff_filespec_size(one);
+               unsigned long size2 = diff_filespec_size(two);
+
+               if (size1 == 0 || size2 == 0)
+                       die("unable to retrieve file sizes for diff");
                /* Quite common confusing case */
-               if (mf1.size == mf2.size &&
-                   !memcmp(mf1.ptr, mf2.ptr, mf1.size)) {
+               if (size1 == size2 && !filecmp_chunked(one,two)) {
                        if (must_show_header)
                                fprintf(o->file, "%s", header.buf);
                        goto free_ab_and_return;
                }
                fprintf(o->file, "%s", header.buf);
                strbuf_reset(&header);
-               if (DIFF_OPT_TST(o, BINARY))
+               if (DIFF_OPT_TST(o, BINARY)) {
+                       if (fill_mmfile(&mf1, one) < 0 || fill_mmfile(&mf2, 
two) < 0)
+                               die("unable to read files to diff");
                        emit_binary_diff(o->file, &mf1, &mf2, line_prefix);
+               }
                else
                        fprintf(o->file, "%sBinary files %s and %s differ\n",
                                line_prefix, lbl[0], lbl[1]);

Then the default diff case, no BINARY flag, would not read both files into 
memory.
filecmp_chunked will be slower than file_mmfile and memcmp, but its whole 
purpose is to read and compare the files in chunks.
The chunk size can be something like 64MiB.


--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 4/4] diff: mark any file larger than core.bigfilethreshold binary

Reply via email to