Re: [PATCH v6 6/6] convert: add "status=delayed" to filter process protocol

2017-06-27 Thread Lars Schneider

> On 27 Jun 2017, at 00:13, Junio C Hamano  wrote:
> 
> Lars Schneider  writes:
> 
>> ...
> 
>>> I am wondering whose responsibility it will be to deal with a path
>>> "list-available" reports that are *not* asked by Git or Git got no
>>> "delayed" response.  The list subtraction done by the caller is
>>> probably the logical place to do so.
>> 
>> Correct. Git (the caller) will notice that not all "delayed" paths
>> are listed by the filter and throw an error at the end.
> 
> I am wondering about the other case.  Git didn't ask for a path to
> be filtered at all, but the filter sneaked in a path that happens to
> in the index in its response---Git should at least ignore it, and
> better yet, diagnose it as an error in the filter process.

Agreed. I've used your implementation suggestion [1] and added two
test cases to ensure we signal a proper error in case of a buggy filter.

Thanks,
Lars


[1] http://public-inbox.org/git/xmqqzicu2x4a@gitster.mtv.corp.google.com/


Re: [PATCH v6 6/6] convert: add "status=delayed" to filter process protocol

2017-06-26 Thread Junio C Hamano
Junio C Hamano  writes:

 +  filter->string = "";
 +  continue;
 +  }
 +
 +  /* In dco->paths we store a list of all delayed paths.
 + The filter just send us a list of available paths.
 + Remove them from the list.
 +  */
 +  filter_string_list(>paths, 0,
 +  _available_paths, _paths);
>>> 
>>> We first remove from the outstanding request list (dco->paths) what
>>> are now ready...
>>> 
 +  for_each_string_list_item(path, _paths) {
>>> 
>>> ...and go over those paths that are now ready.
>>> 
 +  struct cache_entry* ce = index_file_exists(
 +  state->istate, path->string,
 +  strlen(path->string), 0);
 +  assert(dco->state == CE_RETRY);
 +  errs |= (ce ? checkout_entry(ce, state, NULL) : 
 1);
 +  }
>>> 
>>> But we never checked if the contents of this available_paths list is
>>> a subset of the set of paths we originally asked the external
>>> process to filter.
>>
>> Correct.
>>
>>>  This will allow the process to overwrite any
>>> random path that is not even involved in the checkout.
>>
>> No, not "any random path". Only paths that are part of the checkout.
>
> Isn't it "any path that index_file_exists() returns a CE for".  Did
> you filter out elements in available_paths that weren't part of
> dco->paths?  I thought the filter-string-list you have are for the
> other way around (which is necessary to keep track of work to be
> done, but that filtering does not help rejecting rogue responses at
> all).

That is, something along the lines of this on top of the series.
When going over the list of delayed paths to see if any of them is
answered, in remove_available_paths(), we mark the util field of the
answered one.  Later when we go over the list of answered one, we
make sure that it was actually matched with something on dco->paths
before calling checkout_entry().

 entry.c | 20 
 1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/entry.c b/entry.c
index c6d5cc01dc..f2af67e015 100644
--- a/entry.c
+++ b/entry.c
@@ -150,7 +150,12 @@ void enable_delayed_checkout(struct checkout *state)
 static int remove_available_paths(struct string_list_item *item, void *cb_data)
 {
struct string_list *available_paths = cb_data;
-   return !string_list_has_string(available_paths, item->string);
+   struct string_list_item *available;
+
+   available = string_list_lookup(available_paths, item->string);
+   if (available)
+   avaiable->util = (void *)item->string;
+   return !available;
 }
 
 int finish_delayed_checkout(struct checkout *state)
@@ -192,9 +197,16 @@ int finish_delayed_checkout(struct checkout *state)
_available_paths, _paths);
 
for_each_string_list_item(path, _paths) {
-   struct cache_entry* ce = index_file_exists(
-   state->istate, path->string,
-   strlen(path->string), 0);
+   struct cache_entry* ce;
+
+   if (!path->util) {
+   error("filter gave '%s' which was 
unasked for",
+ path->string);
+   errs |= 1;
+   continue;
+   }
+   ce = index_file_exists(state->istate, 
path->string,
+  strlen(path->string), 0);
assert(dco->state == CE_RETRY);
errs |= (ce ? checkout_entry(ce, state, NULL) : 
1);
}


Re: [PATCH v6 6/6] convert: add "status=delayed" to filter process protocol

2017-06-26 Thread Junio C Hamano
Lars Schneider  writes:

> Maybe this?
> [...] If Git sends this command, then the
> filter is expected to return a list of pathnames representing blobs 
> that have been delayed earlier and are now available. [...]

OK.

>>> +by a "success" status that is also terminated with a flush packet. If
>>> +no blobs for the delayed paths are available, yet, then the filter is
>>> +expected to block the response until at least one blob becomes
>>> +available.
>> 
>> Ahh, this is better, at least you use "the delayed paths".
>> 
>> What if the result never gets available (e.g. resulted in an error)?
>
> As soon as the filter responds with an empty list, Git stops asking.
> All blobs that Git has not received at this point are considered an
> error.
>
> Should I mention that explicitly?

Otherwise I wouldn't have wondered "what if".

>> I am wondering whose responsibility it will be to deal with a path
>> "list-available" reports that are *not* asked by Git or Git got no
>> "delayed" response.  The list subtraction done by the caller is
>> probably the logical place to do so.
>
> Correct. Git (the caller) will notice that not all "delayed" paths
> are listed by the filter and throw an error at the end.

I am wondering about the other case.  Git didn't ask for a path to
be filtered at all, but the filter sneaked in a path that happens to
in the index in its response---Git should at least ignore it, and
better yet, diagnose it as an error in the filter process.

>>> +   filter->string = "";
>>> +   continue;
>>> +   }
>>> +
>>> +   /* In dco->paths we store a list of all delayed paths.
>>> +  The filter just send us a list of available paths.
>>> +  Remove them from the list.
>>> +   */
>>> +   filter_string_list(>paths, 0,
>>> +   _available_paths, _paths);
>> 
>> We first remove from the outstanding request list (dco->paths) what
>> are now ready...
>> 
>>> +   for_each_string_list_item(path, _paths) {
>> 
>> ...and go over those paths that are now ready.
>> 
>>> +   struct cache_entry* ce = index_file_exists(
>>> +   state->istate, path->string,
>>> +   strlen(path->string), 0);
>>> +   assert(dco->state == CE_RETRY);
>>> +   errs |= (ce ? checkout_entry(ce, state, NULL) : 
>>> 1);
>>> +   }
>> 
>> But we never checked if the contents of this available_paths list is
>> a subset of the set of paths we originally asked the external
>> process to filter.
>
> Correct.
>
>>  This will allow the process to overwrite any
>> random path that is not even involved in the checkout.
>
> No, not "any random path". Only paths that are part of the checkout.

Isn't it "any path that index_file_exists() returns a CE for".  Did
you filter out elements in available_paths that weren't part of
dco->paths?  I thought the filter-string-list you have are for the
other way around (which is necessary to keep track of work to be
done, but that filtering does not help rejecting rogue responses at
all).

> There are three cases:
>
> (1) available_path is a path that was delayed before (= happy case!)
> (2) available_path is a path that was not delayed before, 
> but filtered (= no problem, as filtering is a idempotent operation)
> (3) available_path is a path that was neither delayed nor filtered
> before (= if the filter returns the blob as-is then this would
> be no problem. otherwise we would indeed have a screwed checkout)
>
> Case 3 might introduce a problem if the filter is buggy.  

> Would you be OK with this check to catch case 3?

I'd be very suspicious about anything you would do only with .nr
field, without filtering the other way around.  After all, we may
have asked it for 3 paths to be filtered, and it may have answered
with its own 3 different paths.

> dco_path_count = dco->paths.nr;
> filter_string_list(>paths, 0,
> _available_paths, _paths);
>
> if (dco_path_count - dco->paths.nr != available_paths.nr) {
> /* The filter responded with entries that have not
>  * been delay earlier. Do not ask the filter
>  * for available blobs, again, as the filter is
>  * likely buggy. This will generate an error at 
>  * the end as some files are not filtered properly.
>  */
> filter->string = "";
> error(_("The external filter '%s' responded with "
> "available blobs which have not been delayed "
> "earlier."), filter->string);
> continue;
> }
>
>
>>> +   }
>>> +   string_list_remove_empty_items(>filters, 0);
>>> +   }
>>> +   string_list_clear(>filters, 0);
>>> +
>>> +   /* At this point we should not have any 

Re: [PATCH v6 6/6] convert: add "status=delayed" to filter process protocol

2017-06-26 Thread Lars Schneider

> On 26 Jun 2017, at 21:02, Junio C Hamano  wrote:
> 
> Lars Schneider  writes:
> 
>> Some `clean` / `smudge` filters might require a significant amount of
>> time to process a single blob (e.g. the Git LFS smudge filter might
>> perform network requests). During this process the Git checkout
>> operation is blocked and Git needs to wait until the filter is done to
>> continue with the checkout.
> 
> Good motivation.  I'd say s/might/may/ to stress the fact that this
> is addressing a real-world problem, i.e. some filters incur network
> latencies.

Agreed.


>> Teach the filter process protocol (introduced in edcc858) to accept the
> 
> When referring to an old commit, we recommend this format
> 
>edcc8581 ("convert: add filter..process option", 2016-10-16)
> 
> (with or without dq around the title) which helps readers by telling
> them how old the thing is and what it was about.

Agreed.


> ...
> 
>> +Delay
>> +^
>> +
>> +If the filter supports the "delay" capability, then Git can send the
>> +flag "can-delay" after the filter command and pathname. This flag
>> +denotes that the filter can delay filtering the current blob (e.g. to
>> +compensate network latencies) by responding with no content but with
>> +the status "delayed" and a flush packet.
>> +
>> +packet:  git> command=smudge
>> +packet:  git> pathname=path/testfile.dat
>> +packet:  git> can-delay=1
>> +packet:  git> 
>> +packet:  git> CONTENT
>> +packet:  git> 
>> +packet:  git< status=delayed
>> +packet:  git< 
>> +
>> +
>> +If the filter supports the "delay" capability then it must support the
>> +"list_available_blobs" command. If Git sends this command, then the
>> +filter is expected to return a list of pathnames of blobs that are
>> +available. The list must be terminated with a flush packet followed
> 
> Is it correct to read the above "pathnames of blobs that are
> availble" as "pathnames, among which Git earlier requested to be
> filtered with "can-delay=1", for which the filtered result is
> ready"?  I do not mean to suggest this awkward wording to replace
> yours, but I found yours a bit too fuzzy for first time readers.
> 
> Perhaps my brain has rotten by hearing the "delayed/lazy transfer of
> large blobs", but it went "Huh?" upon seeing "...are available".

Maybe this?
[...] If Git sends this command, then the
filter is expected to return a list of pathnames representing blobs 
that have been delayed earlier and are now available. [...]


>> +by a "success" status that is also terminated with a flush packet. If
>> +no blobs for the delayed paths are available, yet, then the filter is
>> +expected to block the response until at least one blob becomes
>> +available.
> 
> Ahh, this is better, at least you use "the delayed paths".
> 
> What if the result never gets available (e.g. resulted in an error)?

As soon as the filter responds with an empty list, Git stops asking.
All blobs that Git has not received at this point are considered an
error.

Should I mention that explicitly?


> Or is it considered "we _now_ know the result is an error" and such
> a delayed path that failed to retrieve from LFS store "available" at
> that point?

No. "list_available_blobs" only returns blobs that are immediately
available. Errors are not available.


> It may be too late to raise to a series that went to v6, but this
> smells more like "ready" not "available" to me.

You mean you would call it "list_ready_blobs"? I am no native speaker
but I feel "available" sounds better. I also contemplated "retrievable".

I think I understand your reasoning, though. In the case of GitLFS all these 
blobs are "available". Only the ones that GitLFS has downloaded are 
ready, though. However, other filters might delay blobs for non-network
related reasons and then "available" and "ready" would be the same, no?

I would like to keep "available".

> ...

>> diff --git a/cache.h b/cache.h
>> index ae4c45d379..2ec12c4477 100644
>> --- a/cache.h
>> +++ b/cache.h
>> @@ -1551,8 +1552,11 @@ struct checkout {
>> };
>> #define CHECKOUT_INIT { NULL, "" }
>> 
>> +
>> #define TEMPORARY_FILENAME_LENGTH 25
> 
> You probably do not need that new blank line.

Agreed! I have no idea why/how that got in.


> ...
>> diff --git a/convert.c b/convert.c
>> index e55c034d86..6452ab546a 100644
>> --- a/convert.c
>> +++ b/convert.c
>> @@ -533,7 +534,8 @@ static int start_multi_file_filter_fn(struct 
>> subprocess_entry *subprocess)
>>  if (err)
>>  goto done;
>> 
>> -err = packet_writel(process->in, "capability=clean", 
>> "capability=smudge", NULL);
>> +err = packet_writel(process->in,
>> +"capability=clean", "capability=smudge", "capability=delay", 
>> NULL);
>> 
>>  for (;;) {
>>  cap_buf = packet_read_line(process->out, NULL);
>> @@ -549,6 +551,8 

Re: [PATCH v6 6/6] convert: add "status=delayed" to filter process protocol

2017-06-26 Thread Junio C Hamano
Lars Schneider  writes:

> Some `clean` / `smudge` filters might require a significant amount of
> time to process a single blob (e.g. the Git LFS smudge filter might
> perform network requests). During this process the Git checkout
> operation is blocked and Git needs to wait until the filter is done to
> continue with the checkout.

Good motivation.  I'd say s/might/may/ to stress the fact that this
is addressing a real-world problem, i.e. some filters incur network
latencies.

> Teach the filter process protocol (introduced in edcc858) to accept the

When referring to an old commit, we recommend this format

edcc8581 ("convert: add filter..process option", 2016-10-16)

(with or without dq around the title) which helps readers by telling
them how old the thing is and what it was about.

> @@ -512,12 +512,69 @@ the protocol then Git will stop the filter process and 
> restart it
>  with the next file that needs to be processed. Depending on the
>  `filter..required` flag Git will interpret that as error.
>  
> -After the filter has processed a blob it is expected to wait for
> -the next "key=value" list containing a command. Git will close
> +After the filter has processed a command it is expected to wait for
> +a "key=value" list containing the next command. Git will close

Good generalization.

> +Delay
> +^
> +
> +If the filter supports the "delay" capability, then Git can send the
> +flag "can-delay" after the filter command and pathname. This flag
> +denotes that the filter can delay filtering the current blob (e.g. to
> +compensate network latencies) by responding with no content but with
> +the status "delayed" and a flush packet.
> +
> +packet:  git> command=smudge
> +packet:  git> pathname=path/testfile.dat
> +packet:  git> can-delay=1
> +packet:  git> 
> +packet:  git> CONTENT
> +packet:  git> 
> +packet:  git< status=delayed
> +packet:  git< 
> +
> +
> +If the filter supports the "delay" capability then it must support the
> +"list_available_blobs" command. If Git sends this command, then the
> +filter is expected to return a list of pathnames of blobs that are
> +available. The list must be terminated with a flush packet followed

Is it correct to read the above "pathnames of blobs that are
availble" as "pathnames, among which Git earlier requested to be
filtered with "can-delay=1", for which the filtered result is
ready"?  I do not mean to suggest this awkward wording to replace
yours, but I found yours a bit too fuzzy for first time readers.

Perhaps my brain has rotten by hearing the "delayed/lazy transfer of
large blobs", but it went "Huh?" upon seeing "...are available".

> +by a "success" status that is also terminated with a flush packet. If
> +no blobs for the delayed paths are available, yet, then the filter is
> +expected to block the response until at least one blob becomes
> +available.

Ahh, this is better, at least you use "the delayed paths".

What if the result never gets available (e.g. resulted in an error)?
Or is it considered "we _now_ know the result is an error" and such
a delayed path that failed to retrieve from LFS store "available" at
that point?

It may be too late to raise to a series that went to v6, but this
smells more like "ready" not "available" to me.

> diff --git a/builtin/checkout.c b/builtin/checkout.c
> index a6b2af39d3..c1a256df8d 100644
> --- a/builtin/checkout.c
> +++ b/builtin/checkout.c
> @@ -376,6 +376,8 @@ static int checkout_paths(const struct checkout_opts 
> *opts,
>   state.force = 1;
>   state.refresh_cache = 1;
>   state.istate = _index;
> +
> + enable_delayed_checkout();
>   for (pos = 0; pos < active_nr; pos++) {
>   struct cache_entry *ce = active_cache[pos];
>   if (ce->ce_flags & CE_MATCHED) {
> @@ -390,6 +392,7 @@ static int checkout_paths(const struct checkout_opts 
> *opts,
>   pos = skip_same_name(ce, pos) - 1;
>   }
>   }
> + errs |= finish_delayed_checkout();

OK.


>   if (write_locked_index(_index, lock_file, COMMIT_LOCK))
>   die(_("unable to write new index file"));

> diff --git a/cache.h b/cache.h
> index ae4c45d379..2ec12c4477 100644
> --- a/cache.h
> +++ b/cache.h
> @@ -1551,8 +1552,11 @@ struct checkout {
>  };
>  #define CHECKOUT_INIT { NULL, "" }
>  
> +
>  #define TEMPORARY_FILENAME_LENGTH 25

You probably do not need that new blank line.

>  extern int checkout_entry(struct cache_entry *ce, const struct checkout 
> *state, char *topath);
> +extern void enable_delayed_checkout(struct checkout *state);
> +extern int finish_delayed_checkout(struct checkout *state);

OK.

> diff --git a/convert.c b/convert.c
> index e55c034d86..6452ab546a 100644
> --- a/convert.c
> +++ b/convert.c
> @@ -533,7 +534,8 @@ static int start_multi_file_filter_fn(struct 
> subprocess_entry 

[PATCH v6 6/6] convert: add "status=delayed" to filter process protocol

2017-06-25 Thread Lars Schneider
Some `clean` / `smudge` filters might require a significant amount of
time to process a single blob (e.g. the Git LFS smudge filter might
perform network requests). During this process the Git checkout
operation is blocked and Git needs to wait until the filter is done to
continue with the checkout.

Teach the filter process protocol (introduced in edcc858) to accept the
status "delayed" as response to a filter request. Upon this response Git
continues with the checkout operation. After the checkout operation Git
calls "finish_delayed_checkout" which queries the filter for remaining
blobs. If the filter is still working on the completion, then the filter
is expected to block. If the filter has completed all remaining blobs
then an empty response is expected.

Git has a multiple code paths that checkout a blob. Support delayed
checkouts only in `clone` (in unpack-trees.c) and `checkout` operations
for now. The optimization is most effective in these code paths as all
files of the tree are processed.

Signed-off-by: Lars Schneider 
---
 Documentation/gitattributes.txt |  65 +-
 builtin/checkout.c  |   3 +
 cache.h |   4 +
 convert.c   | 115 
 convert.h   |  25 ++
 entry.c | 114 ++--
 t/t0021-conversion.sh   |  74 
 t/t0021/rot13-filter.pl | 189 ++--
 unpack-trees.c  |   2 +
 9 files changed, 501 insertions(+), 90 deletions(-)

diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index 4736483865..5489e8b723 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -425,8 +425,8 @@ packet:  git< capability=clean
 packet:  git< capability=smudge
 packet:  git< 
 
-Supported filter capabilities in version 2 are "clean" and
-"smudge".
+Supported filter capabilities in version 2 are "clean", "smudge",
+and "delay".
 
 Afterwards Git sends a list of "key=value" pairs terminated with
 a flush packet. The list will contain at least the filter command
@@ -512,12 +512,69 @@ the protocol then Git will stop the filter process and 
restart it
 with the next file that needs to be processed. Depending on the
 `filter..required` flag Git will interpret that as error.
 
-After the filter has processed a blob it is expected to wait for
-the next "key=value" list containing a command. Git will close
+After the filter has processed a command it is expected to wait for
+a "key=value" list containing the next command. Git will close
 the command pipe on exit. The filter is expected to detect EOF
 and exit gracefully on its own. Git will wait until the filter
 process has stopped.
 
+Delay
+^
+
+If the filter supports the "delay" capability, then Git can send the
+flag "can-delay" after the filter command and pathname. This flag
+denotes that the filter can delay filtering the current blob (e.g. to
+compensate network latencies) by responding with no content but with
+the status "delayed" and a flush packet.
+
+packet:  git> command=smudge
+packet:  git> pathname=path/testfile.dat
+packet:  git> can-delay=1
+packet:  git> 
+packet:  git> CONTENT
+packet:  git> 
+packet:  git< status=delayed
+packet:  git< 
+
+
+If the filter supports the "delay" capability then it must support the
+"list_available_blobs" command. If Git sends this command, then the
+filter is expected to return a list of pathnames of blobs that are
+available. The list must be terminated with a flush packet followed
+by a "success" status that is also terminated with a flush packet. If
+no blobs for the delayed paths are available, yet, then the filter is
+expected to block the response until at least one blob becomes
+available. The filter can tell Git that it has no more delayed blobs
+by sending an empty list.
+
+packet:  git> command=list_available_blobs
+packet:  git> 
+packet:  git< pathname=path/testfile.dat
+packet:  git< pathname=path/otherfile.dat
+packet:  git< 
+packet:  git< status=success
+packet:  git< 
+
+
+After Git received the pathnames, it will request the corresponding
+blobs again. These requests contain a pathname and an empty content
+section. The filter is expected to respond with the smudged content
+in the usual way as explained above.
+
+packet:  git> command=smudge
+packet:  git> pathname=path/testfile.dat
+packet:  git> 
+packet:  git>   # empty content!
+packet:  git< status=success
+packet:  git< 
+packet:  git< SMUDGED_CONTENT