Re: [PATCH v1 1/2] entry.c: update cache entry only for existing files

2017-10-09 Thread Jeff King
On Sun, Oct 08, 2017 at 11:37:14PM +0200, Lars Schneider wrote:

> >> Yeah, I think that makes more sense.
> >> 
> >> A patch may look like this on top of these two patches, but I'd
> >> prefer to see Lars's eyeballing and possibly wrapping it up in an
> >> applicable patch after taking the authorship.
> > 
> 
> This looks all good to me. Thank you!
> A few minor style suggestions below.

Thanks, these were all reasonable (I actually avoided unwrapping some of
the lines in the original to make the "-w" diff more readable :) ).

I ended up breaking this into three commits, since I think that makes it
easier to see what each of the changes is doing. Here's what I have (on
top of what Junio has already queued in ls/filter-process-delayed):

  [v2 1/3]: write_entry: fix leak when retrying delayed filter
  [v2 2/3]: write_entry: avoid reading blobs in CE_RETRY case
  [v2 3/3]: write_entry: untangle symlink and regular-file cases

 entry.c | 83 ++---
 1 file changed, 48 insertions(+), 35 deletions(-)

-Peff


Re: [PATCH v1 1/2] entry.c: update cache entry only for existing files

2017-10-08 Thread Lars Schneider

> On 06 Oct 2017, at 06:54, Jeff King  wrote:
> 
> On Fri, Oct 06, 2017 at 08:01:48AM +0900, Junio C Hamano wrote:
> 
>>> But
>>> I think we'd want to protect the read_blob_entry() call at the top of
>>> the case with a check for dco->state == CE_RETRY.
>> 
>> Yeah, I think that makes more sense.
>> 
>> A patch may look like this on top of these two patches, but I'd
>> prefer to see Lars's eyeballing and possibly wrapping it up in an
>> applicable patch after taking the authorship.
> 

This looks all good to me. Thank you!
A few minor style suggestions below.


> ...
> 
> The "structured" way, of course, would be to put everything under
> write_out_file into a helper function and just call it from both places
> rather than relying on a spaghetti of gotos and switch-breaks.
> 
> I'm OK with whatever structure we end up with, as long as it fixes the
> leak (and ideally the pessimization).
> 
> Anyway, here's the real patch in case anybody wants to apply it and play
> with it further.
> 
> -- >8 --
> diff --git a/entry.c b/entry.c
> index 1c7e3c11d5..d28b42d82d 100644
> --- a/entry.c
> +++ b/entry.c
> @@ -261,6 +261,7 @@ static int write_entry(struct cache_entry *ce,
>   size_t newsize = 0;
>   struct stat st;
>   const struct submodule *sub;
> + struct delayed_checkout *dco = state->delayed_checkout;
> 
>   if (ce_mode_s_ifmt == S_IFREG) {
>   struct stream_filter *filter = get_stream_filter(ce->name,
> @@ -273,55 +274,61 @@ static int write_entry(struct cache_entry *ce,
>   }
> 
>   switch (ce_mode_s_ifmt) {
> - case S_IFREG:
>   case S_IFLNK:
>   new = read_blob_entry(ce, );
>   if (!new)
>   return error("unable to read sha1 file of %s (%s)",
>   path, oid_to_hex(>oid));
> 
> - if (ce_mode_s_ifmt == S_IFLNK && has_symlinks && !to_tempfile) {
> - ret = symlink(new, path);
> - free(new);
> - if (ret)
> - return error_errno("unable to create symlink 
> %s",
> -path);

Nit: This could go into one line now.


> - break;
> - }
> + /* fallback to handling it like a regular file if we must */
> + if (!has_symlinks || to_tempfile)
> + goto write_out_file;
> 
> + ret = symlink(new, path);
> + free(new);
> + if (ret)
> + return error_errno("unable to create symlink %s",
> +path);
> + break;
> +
> + case S_IFREG:
>   /*
>* Convert from git internal format to working tree format
>*/
> - if (ce_mode_s_ifmt == S_IFREG) {
> - struct delayed_checkout *dco = state->delayed_checkout;
> - if (dco && dco->state != CE_NO_DELAY) {
> - /* Do not send the blob in case of a retry. */
> - if (dco->state == CE_RETRY) {

Maybe we could add here something like:
/* The filer process got the blob already in case of a retry. 
Unnecessary to send it, again! */

> - new = NULL;
> - size = 0;
> - }
> - ret = async_convert_to_working_tree(
> - ce->name, new, size, , dco);

Nit: This could go into one line now.


> - if (ret && string_list_has_string(>paths, 
> ce->name)) {
> - free(new);
> - goto finish;
> - }
> - } else
> - ret = convert_to_working_tree(
> - ce->name, new, size, );

Nit: This could go into one line now.


> 
> - if (ret) {
> + if (dco && dco->state == CE_RETRY) {
> + new = NULL;
> + size = 0;
> + } else {
> + new = read_blob_entry(ce, );
> + if (!new)
> + return error ("unable to read sha1 file of %s 
> (%s)",
> +   path, oid_to_hex(>oid));
> + }
> +
> + if (dco && dco->state != CE_NO_DELAY) {
> + ret = async_convert_to_working_tree(
> + ce->name, new, 
> size, , dco);
> + if (ret && string_list_has_string(>paths, 
> ce->name)) {
>   free(new);
> - new = strbuf_detach(, );
> - size = newsize;
> + goto finish;
> 

Re: [PATCH v1 1/2] entry.c: update cache entry only for existing files

2017-10-05 Thread Jeff King
On Fri, Oct 06, 2017 at 08:01:48AM +0900, Junio C Hamano wrote:

> > But
> > I think we'd want to protect the read_blob_entry() call at the top of
> > the case with a check for dco->state == CE_RETRY.
> 
> Yeah, I think that makes more sense.
> 
> A patch may look like this on top of these two patches, but I'd
> prefer to see Lars's eyeballing and possibly wrapping it up in an
> applicable patch after taking the authorship.

Yeah, agreed.

> I considered initializing new to NULL and size to 0 but decided
> against it, as that would lose the justification to have an if
> statement that marks that "dco->state == CE_RETRY" is a special
> case.  I think explicit if() with clearing these two variables makes
> it clearer to show what is going on.

Also agreed.

> By the way, the S_IFLNK handling seems iffy with or without this
> change (or for that matter, I suspect this iffy-ness existed before
> Lars's delayed filtering change).  On a platform without symlinks,
> we do the same as S_IFREG, but obviously we do not want any content
> conversion that happens to the regular files in such a case.  So we
> may further want to fix that, but I left it outside the scope of
> fixing the leak of NULL and optimizing the blob reading out.

I think the current code is correct because the conversion all happens
in the S_IFREG if-block. We just fall-through down to the actual write
phase for the symlink case.

That said, I found the fall-through here confusing as hell. I actually
think it would be a lot clearer with a goto, which is saying something.
Here's the "diff -w" of what I mean, for readability (the real patch is
at the bottom for reference, but it adjusts the indentation quite a
bit).

diff --git a/entry.c b/entry.c
index 1c7e3c11d5..d28b42d82d 100644
--- a/entry.c
+++ b/entry.c
@@ -261,6 +261,7 @@ static int write_entry(struct cache_entry *ce,
size_t newsize = 0;
struct stat st;
const struct submodule *sub;
+   struct delayed_checkout *dco = state->delayed_checkout;
 
if (ce_mode_s_ifmt == S_IFREG) {
struct stream_filter *filter = get_stream_filter(ce->name,
@@ -273,33 +274,39 @@ static int write_entry(struct cache_entry *ce,
}
 
switch (ce_mode_s_ifmt) {
-   case S_IFREG:
case S_IFLNK:
new = read_blob_entry(ce, );
if (!new)
return error("unable to read sha1 file of %s (%s)",
path, oid_to_hex(>oid));
 
-   if (ce_mode_s_ifmt == S_IFLNK && has_symlinks && !to_tempfile) {
+   /* fallback to handling it like a regular file if we must */
+   if (!has_symlinks || to_tempfile)
+   goto write_out_file;
+
ret = symlink(new, path);
free(new);
if (ret)
return error_errno("unable to create symlink %s",
   path);
break;
-   }
 
+   case S_IFREG:
/*
 * Convert from git internal format to working tree format
 */
-   if (ce_mode_s_ifmt == S_IFREG) {
-   struct delayed_checkout *dco = state->delayed_checkout;
-   if (dco && dco->state != CE_NO_DELAY) {
-   /* Do not send the blob in case of a retry. */
-   if (dco->state == CE_RETRY) {
+
+   if (dco && dco->state == CE_RETRY) {
new = NULL;
size = 0;
+   } else {
+   new = read_blob_entry(ce, );
+   if (!new)
+   return error ("unable to read sha1 file of %s 
(%s)",
+ path, oid_to_hex(>oid));
}
+
+   if (dco && dco->state != CE_NO_DELAY) {
ret = async_convert_to_working_tree(
ce->name, new, 
size, , dco);
if (ret && string_list_has_string(>paths, 
ce->name)) {
@@ -320,8 +327,8 @@ static int write_entry(struct cache_entry *ce,
 * point. If the error would have been fatal (e.g.
 * filter is required), then we would have died already.
 */
-   }
 
+write_out_file:
fd = open_output_fd(path, ce, to_tempfile);
if (fd < 0) {
free(new);

The "structured" way, of course, would be to put everything under
write_out_file into a helper function and just call it from both places
rather than relying on a spaghetti of gotos and switch-breaks.

I'm OK with whatever structure we end up with, as long as it fixes the
leak (and ideally the pessimization).

Anyway, here's the real patch in case anybody wants to apply it and play
with it further.

-- >8 --
diff 

Re: [PATCH v1 1/2] entry.c: update cache entry only for existing files

2017-10-05 Thread Junio C Hamano
Jeff King  writes:

> On Thu, Oct 05, 2017 at 08:19:13PM +0900, Junio C Hamano wrote:
>
>> This is unrelated to the main topic of this patch, but we see this
>> just before the precontext of this hunk:
>> 
>>  if (dco && dco->state != CE_NO_DELAY) {
>>  /* Do not send the blob in case of a retry. */
>>  if (dco->state == CE_RETRY) {
>>  new = NULL;
>>  size = 0;
>>  }
>>  ret = async_convert_to_working_tree(
>>  ce->name, new, size, , dco);
>> 
>> Aren't we leaking "new" in that CE_RETRY case?
>
> Yes, it certainly looks like it. Wouldn't we want to avoid reading the
> file from disk entirely in that case?

Probably.  But that is more of a removal of pessimization than a fix ;-)

> I.e., I think free(new) is sufficient to fix the leak you
> mentioned.

In addition to keeping the new = NULL assignment, of course.

> But
> I think we'd want to protect the read_blob_entry() call at the top of
> the case with a check for dco->state == CE_RETRY.

Yeah, I think that makes more sense.

A patch may look like this on top of these two patches, but I'd
prefer to see Lars's eyeballing and possibly wrapping it up in an
applicable patch after taking the authorship.

I considered initializing new to NULL and size to 0 but decided
against it, as that would lose the justification to have an if
statement that marks that "dco->state == CE_RETRY" is a special
case.  I think explicit if() with clearing these two variables makes
it clearer to show what is going on.

By the way, the S_IFLNK handling seems iffy with or without this
change (or for that matter, I suspect this iffy-ness existed before
Lars's delayed filtering change).  On a platform without symlinks,
we do the same as S_IFREG, but obviously we do not want any content
conversion that happens to the regular files in such a case.  So we
may further want to fix that, but I left it outside the scope of
fixing the leak of NULL and optimizing the blob reading out.


 entry.c | 26 +-
 1 file changed, 17 insertions(+), 9 deletions(-)

diff --git a/entry.c b/entry.c
index cac5bf5af2..74e35f942c 100644
--- a/entry.c
+++ b/entry.c
@@ -274,14 +274,12 @@ static int write_entry(struct cache_entry *ce,
}
 
switch (ce_mode_s_ifmt) {
-   case S_IFREG:
case S_IFLNK:
new = read_blob_entry(ce, );
if (!new)
return error("unable to read sha1 file of %s (%s)",
path, oid_to_hex(>oid));
-
-   if (ce_mode_s_ifmt == S_IFLNK && has_symlinks && !to_tempfile) {
+   if (has_symlinks && !to_tempfile) {
ret = symlink(new, path);
free(new);
if (ret)
@@ -289,18 +287,28 @@ static int write_entry(struct cache_entry *ce,
   path);
break;
}
-
+   /* fallthru */
+   case S_IFREG:
/*
 * Convert from git internal format to working tree format
 */
if (ce_mode_s_ifmt == S_IFREG) {
struct delayed_checkout *dco = state->delayed_checkout;
+
+   /* 
+* In case of a retry, we do not send blob, hence no
+* need to read it, either.
+*/
+   if (dco && dco->state == CE_RETRY) {
+   new = NULL;
+   size = 0;
+   } else {
+   new = read_blob_entry(ce, );
+   if (!new)
+   return error("unable to read sha1 file 
of %s (%s)",
+path, 
oid_to_hex(>oid));
+   }
if (dco && dco->state != CE_NO_DELAY) {
-   /* Do not send the blob in case of a retry. */
-   if (dco->state == CE_RETRY) {
-   new = NULL;
-   size = 0;
-   }
ret = async_convert_to_working_tree(
ce->name, new, size, , dco);
if (ret && string_list_has_string(>paths, 
ce->name)) {


Re: [PATCH v1 1/2] entry.c: update cache entry only for existing files

2017-10-05 Thread Jeff King
On Thu, Oct 05, 2017 at 08:19:13PM +0900, Junio C Hamano wrote:

> > diff --git a/entry.c b/entry.c
> > index 1c7e3c11d5..5dab656364 100644
> > --- a/entry.c
> > +++ b/entry.c
> > @@ -304,7 +304,7 @@ static int write_entry(struct cache_entry *ce,
> > ce->name, new, size, , dco);
> > if (ret && string_list_has_string(>paths, 
> > ce->name)) {
> > free(new);
> > -   goto finish;
> > +   goto delayed;
> > }
> > } else
> > ret = convert_to_working_tree(
> 
> This is unrelated to the main topic of this patch, but we see this
> just before the precontext of this hunk:
> 
>   if (dco && dco->state != CE_NO_DELAY) {
>   /* Do not send the blob in case of a retry. */
>   if (dco->state == CE_RETRY) {
>   new = NULL;
>   size = 0;
>   }
>   ret = async_convert_to_working_tree(
>   ce->name, new, size, , dco);
> 
> Aren't we leaking "new" in that CE_RETRY case?

Yes, it certainly looks like it. Wouldn't we want to avoid reading the
file from disk entirely in that case?

I.e., I think free(new) is sufficient to fix the leak you mentioned. But
I think we'd want to protect the read_blob_entry() call at the top of
the case with a check for dco->state == CE_RETRY.

-Peff


Re: [PATCH v1 1/2] entry.c: update cache entry only for existing files

2017-10-05 Thread Junio C Hamano
lars.schnei...@autodesk.com writes:

> diff --git a/entry.c b/entry.c
> index 1c7e3c11d5..5dab656364 100644
> --- a/entry.c
> +++ b/entry.c
> @@ -304,7 +304,7 @@ static int write_entry(struct cache_entry *ce,
>   ce->name, new, size, , dco);
>   if (ret && string_list_has_string(>paths, 
> ce->name)) {
>   free(new);
> - goto finish;
> + goto delayed;
>   }
>   } else
>   ret = convert_to_working_tree(

This is unrelated to the main topic of this patch, but we see this
just before the precontext of this hunk:

if (dco && dco->state != CE_NO_DELAY) {
/* Do not send the blob in case of a retry. */
if (dco->state == CE_RETRY) {
new = NULL;
size = 0;
}
ret = async_convert_to_working_tree(
ce->name, new, size, , dco);

Aren't we leaking "new" in that CE_RETRY case?


Re: [PATCH v1 1/2] entry.c: update cache entry only for existing files

2017-10-05 Thread Jeff King
On Thu, Oct 05, 2017 at 12:44:06PM +0200, lars.schnei...@autodesk.com wrote:

> From: Lars Schneider 
> 
> In 2841e8f ("convert: add "status=delayed" to filter process protocol",
> 2017-06-30) we taught the filter process protocol to delay responses.
> 
> That means an external filter might answer in the first write_entry()
> call on a file that requires filtering  "I got your request, but I
> can't answer right now. Ask again later!". As Git got no answer, we do
> not write anything to the filesystem. Consequently, the lstat() call in
> the finish block of the function writes garbage to the cache entry.
> The garbage is eventually overwritten when the filter answers with
> the final file content in a subsequent write_entry() call.
> 
> Fix the brief time window of garbage in the cache entry by adding a
> special finish block that does nothing for delayed responses. The cache
> entry is written properly in a subsequent write_entry() call where
> the filter responds with the final file content.

Nicely explained and the patch looks correct. I also verified that MSan
is happy with t0021 after this.

Thanks for the quick turnaround.

-Peff