Am 26.06.2014 19:55, schrieb Junio C Hamano:
> Nguyễn Thái Ngọc Duy <[email protected]> writes:
>
>> Too large files may lead to failure to allocate memory. If it happens
>> here, it could impact quite a few commands that involve
>> diff. Moreover, too large files are inefficient to compare anyway (and
>> most likely non-text), so mark them binary and skip looking at their
>> content.
>> ...
>> diff --git a/diff.c b/diff.c
>> index a489540..7a977aa 100644
>> --- a/diff.c
>> +++ b/diff.c
>> @@ -2185,8 +2185,8 @@ int diff_filespec_is_binary(struct diff_filespec *one)
>> one->is_binary = one->driver->binary;
>> else {
>> if (!one->data && DIFF_FILE_VALID(one))
>> - diff_populate_filespec(one, 0);
>> - if (one->data)
>> + diff_populate_filespec(one,
>> DIFF_POPULATE_IS_BINARY);
>> + if (one->is_binary == -1 && one->data)
>> one->is_binary = buffer_is_binary(one->data,
>> one->size);
>> if (one->is_binary == -1)
>
> The name is misleading and forced me to read it twice before I
> realized that this is "populating the is-binary bit". It might make
> it a bit better if you renamed it to DIFF_POPULATE_IS_BINARY_BIT or
> perhaps DIFF_POPULATE_CHECK_BINARY or something. For consistency,
> the other bit may want to be also renamed from SIZE_ONLY to either
>
> (1) CHECK_SIZE_ONLY
>
> (2) One bit for CHECK_SIZE, another for NO_CONTENTS, and optionally
> make SIZE_ONLY the union of two
>
> I do not have strong preference either way; the latter may be more
> logical in that "not loading contents" and "check size" are sort of
> orthogonal in that you can later choose to check, without loading
> contents, only the binary-ness without checking size, but no calles
> that passes a non-zero flag to the populate-filespec function will
> want to slurp in the contents in practice, so in that sense we could
> declare that the NO_CONENTS bit is implied.
>
> But more importantly, would this patch actually help? For one
> thing, this wouldn't (and shouldn't) help if the user wants --binary
> diff out of us anyway, I suspect, but I wonder what the following
> codepath in the builtin_diff() function would do:
>
> ...
> } else if (!DIFF_OPT_TST(o, TEXT) &&
> ( (!textconv_one && diff_filespec_is_binary(one)) ||
> (!textconv_two && diff_filespec_is_binary(two)) )) {
> if (fill_mmfile(&mf1, one) < 0 || fill_mmfile(&mf2, two) < 0)
> die("unable to read files to diff");
> /* Quite common confusing case */
> if (mf1.size == mf2.size &&
> !memcmp(mf1.ptr, mf2.ptr, mf1.size)) {
> if (must_show_header)
> fprintf(o->file, "%s", header.buf);
> goto free_ab_and_return;
> }
> fprintf(o->file, "%s", header.buf);
> strbuf_reset(&header);
> if (DIFF_OPT_TST(o, BINARY))
> emit_binary_diff(o->file, &mf1, &mf2, line_prefix);
> else
> fprintf(o->file, "%sBinary files %s and %s differ\n",
> line_prefix, lbl[0], lbl[1]);
> o->found_changes = 1;
> } else {
> ...
>
> If we weren't told with --text/-a to force textual output, and
> at least one of the sides is marked as binary (and this patch marks
> a large blob as binary unless attributes says otherwise), we still
> call fill_mmfile() on them to slurp the contents to be compared, no?
>
> And before you get to the DIFF_OPT_TST(o, BINARY), you memcmp(3) to
> check if the sides are identical, so...
Good point. So how about an additional change roughly sketched as
@@ -2223,6 +2223,14 @@ struct userdiff_driver *get_textconv(struct
diff_filespec *one)
return userdiff_get_textconv(one->driver);
}
+/* read the files in small chunks into memory and compare them */
+static int filecmp_chunked(struct diff_filespec *one,
+ struct diff_filespec *two)
+{
+ // TODO add implementation
+ return 0;
+}
+
static void builtin_diff(const char *name_a,
const char *name_b,
struct diff_filespec *one,
@@ -2325,19 +2333,26 @@ static void builtin_diff(const char *name_a,
} else if (!DIFF_OPT_TST(o, TEXT) &&
( (!textconv_one && diff_filespec_is_binary(one)) ||
(!textconv_two && diff_filespec_is_binary(two)) )) {
- if (fill_mmfile(&mf1, one) < 0 || fill_mmfile(&mf2,two)< 0)
- die("unable to read files to diff");
+
+ unsigned long size1 = diff_filespec_size(one);
+ unsigned long size2 = diff_filespec_size(two);
+
+ if (size1 == 0 || size2 == 0)
+ die("unable to retrieve file sizes for diff");
/* Quite common confusing case */
- if (mf1.size == mf2.size &&
- !memcmp(mf1.ptr, mf2.ptr, mf1.size)) {
+ if (size1 == size2 && !filecmp_chunked(one,two)) {
if (must_show_header)
fprintf(o->file, "%s", header.buf);
goto free_ab_and_return;
}
fprintf(o->file, "%s", header.buf);
strbuf_reset(&header);
- if (DIFF_OPT_TST(o, BINARY))
+ if (DIFF_OPT_TST(o, BINARY)) {
+ if (fill_mmfile(&mf1, one) < 0 || fill_mmfile(&mf2,
two) < 0)
+ die("unable to read files to diff");
emit_binary_diff(o->file, &mf1, &mf2, line_prefix);
+ }
else
fprintf(o->file, "%sBinary files %s and %s differ\n",
line_prefix, lbl[0], lbl[1]);
Then the default diff case, no BINARY flag, would not read both files into
memory.
filecmp_chunked will be slower than file_mmfile and memcmp, but its whole
purpose is to read and compare the files in chunks.
The chunk size can be something like 64MiB.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html