Re: [PATCH v6 12/13] convert: add filter..process option

2016-09-05 Thread Lars Schneider

> On 30 Aug 2016, at 22:46, Jakub Narębski  wrote:
> 
>>> The filter can exit right after the "error-all". If the filter does
>>> not exit then Git will kill the filter. I'll add this to the docs.
>> 
>> OK.
> 
> Is it explicit kill, or implicit kill: close pipe and end process?

I close the pipe and call "finish_command".


>>> "abort" could be ambiguous because it could be read as "abort only
>>> this file". "abort-all" would work, though. Would you prefer to see
>>> "error" replaced by "abort" and "error-all" by "abort-all"?
>> 
>> No.
>> 
>> I was primarily reacting to "-all" part, so anything that ends with
>> "-all" is equally ugly from my point of view and not an improvement.
>> 
>> As I said, "error-all" as long as other reviewers are happy with is
>> OK by me, too.
> 
> I'm rather partial to "abort" instead of "error-all", or "quit"/"exit"
> (but I prefer "abort" or "bail-out"), as it better reflects what this
> response is about - ending filter process.

After some thinking I agree with "abort" instead of "error-all".


>>> A filter that dies during communication or does not adhere to the protocol
>>> is a faulty filter. Feeding the faulty filter after restart with the same 
>>> blob would likely cause the same error. 
>> 
>> Why does Git restart it and continue with the rest of the blobs
>> then?  Is there a reason to believe it may produce correct results
>> for other blobs if you run it again?
> 
> I guess the idea is that single filter can be run on different types
> of blobs, and it could fail on some types (some files) and not others.
> Like LFS-type filter failing on files with size larger than 2 GiB,
> or iconv-like filter to convert UTF-16 to UTF-8 failing on invalid
> byte sequences.

This mimics the v1 behavior and I will explain that in the documentation.


>>> B) If we communicate "shutdown" to the filter then we need to give the
>>>   filter some time to perform the exit before the filter is killed on
>>>   Git exit. I wasn't able to come up with a good answer how long Git 
>>>   should wait for the exit.
>> 
>> Would that be an issue?  A filter is buggy if told to shutdown,
>> ignores the request and hangs around forever.  I do not think we
>> even need to kill it.  It is not Git's problem.
> 
> I think the idea was for Git to request shutdown, and filter to tell
> when it finished (or just exiting, and Git getting SIGCHLD, I guess).
> It is hard to tell how much to wait, for example for filter process
> to send "sync" command, or finish unmounting, or something...

I like Junios approach because then we don't need to wait at all...


>> I personally would be reluctant to become a filter process writer if
>> the system it will be talking to can kill it without giving it a
>> chance to do a graceful exit, but perhaps that is just me.  I don't
>> know if closing the pipe going there where you are having Git to
>> kill the process on the other end is any more involved than what you
>> have without extra patches.
> 
> Isn't it the same problem with v1 filters being killed on Git process
> exit?  Unless v2 filter wants to do something _after_ writing output
> to Git, it should be safe... unless Git process is killed, but it
> is not different from filter being stray-killed.

Yes, it should be safe. However, I think it is probably nicer if the filter
process can shutdown gracefully instead of being killed.


> +Please note that you cannot use an existing `filter..clean`
> +or `filter..smudge` command with `filter..process`
> +because the former two use a different inter process communication
> +protocol than the latter one.
 
 Would it be a useful sample program we can ship in contrib/ if you
 created a "filter adapter" that reads these two configuration
 variables and act as a filter..process?
>>> 
>>> You mean a v2 filter that would use v1 filters under the hood?
>>> If we would drop v1, then this would be useful. Otherwise I don't
>>> see any need for such a thing.
>> 
>> I meant it as primarily an example people can learn from when they
>> want to write their own.
> 
> Errr... if it were any v1 filter, it would be useless (as bad or
> worse performance than ordinary v1 clean or smudge filter).  It
> might make sense if v1 filter and v2 wrapper were written in the
> same [dynamic] language, and wrapper could just load / require
> filter as a function / procedure, [compile it], and use it.
> For example smudge/clean filter in Perl (e.g. rot13), and v2 wrapper
> in Perl too.

I'll provide a simple Perl rot13 v2 filter in contrib.

Thanks,
Lars

Re: [PATCH v6 12/13] convert: add filter..process option

2016-08-31 Thread Jakub Narębski
Hi,

W dniu 31.08.2016 o 06:57, Torsten Bögershausen pisze:
> On Tue, Aug 30, 2016 at 03:23:10PM -0700, Junio C Hamano wrote:
>> Lars Schneider  writes:
>>> On 30 Aug 2016, at 20:59, Junio C Hamano  wrote:

Torsten, could you please in the future remove irrelevant parts of the
cited email you are responding to?  Thanks in advance,

[...]
 I meant it as primarily an example people can learn from when they
 want to write their own.
>>>
>>> I think `t/t0021/rot13-filter.pl` (part of this patch) serves this purpose 
>>> already.
>>
>> I would expect them to peek at contrib/, but do you seriously expect
>> people to comb through t/ directory?
> 
> How about a filter written in C, and placed in ./t/helper/ ?
>
> At least I feel that a filter in C-language could be a starting point
> for others which prefer, depending on the task the filter is going to do,
> to use shell scripts, perl, python or any other high-level language.

People do not look into t/helper/ area, but they do into contrib/.
Git provides examples on how to use its features there (among other
things).  Examples for how use the fast-import feature / protocol
are here, in contrib/fast-import.

I think that C language would be not a good choice, as required
low level wrangling of the protocol would obscure the relevant parts,
... well perhaps except if pkt-line was reused.  High-level language
would be better.  The contrib/fast-import directory includes examples
in shell script, Perl and Python.  The same could be done here
(possibly with more languages).
 
> A test case, where data can not be filtered, would be a minimum.
> As Jakub pointed out, you can use iconv with good and bad data.

The iconv might be good example, but I think it is not a practical
one; are there development environments that do not handle UTF-8?

More practical would be to have for example LFS-ish solution of
storing files as-is in a specified directory.  ROT13 is also good,
if not a very practical example.  Or rezipping all ZIP-compression
based files (ODF, JAR/WAR, DOCX / OOXML, CBZ, EPUB, APK, etc.).

-- 
Jakub Narębski



Re: [PATCH v6 12/13] convert: add filter..process option

2016-08-30 Thread Torsten Bögershausen
On Tue, Aug 30, 2016 at 03:23:10PM -0700, Junio C Hamano wrote:
> Lars Schneider  writes:
> 
> > On 30 Aug 2016, at 20:59, Junio C Hamano  wrote:
> >> 
> >>> "abort" could be ambiguous because it could be read as "abort only
> >>> this file". "abort-all" would work, though. Would you prefer to see
> >>> "error" replaced by "abort" and "error-all" by "abort-all"?
> >> 
> >> No.
> >> 
> >> I was primarily reacting to "-all" part, so anything that ends with
> >> "-all" is equally ugly from my point of view and not an improvement.
> >> 
> >> As I said, "error-all" as long as other reviewers are happy with is
> >> OK by me, too.
> >
> > Now, I see your point. How about "error-and-stop" or "error-stop"
> > instead of "error-all"?
> 
> Not really.  I was primarily reacting to having "error" and
> "error-all", that share the same prefix.  Changing "-all" suffix to
> "-stop" does not make all that difference to me.  But in any case,
> as long as other reviewers are happy with it, it is OK by me, too.
> 
> > Good argument. However, my intention here was to mimic the v1 filter
> > mechanism.
> 
> I am not making any argument and in no way against the chosen
> behaviour.  That "intention here" that was revealed after two
> iterations is something we would want to see as the reasoning
> behind the choice of the final behaviour recorded somewhere,
> and now I drew it out of you, I achieved what I set out to do
> initially ;-)
> 
> > I am not sure I understand your last sentence. Just to be clear:
> > Would you prefer it, if Git would just close the pipe to the filter process
> > on Git exit and leave the filter running?
> 
> As a potential filter writer, I would probably feel it the easiest
> to handle if there is no signal involved.  My job would be to read
> from the command pipe, react to it, and when the command pipe gives
> me EOF, I am expected to exit myself.  That would be simple enough.
> 
> >> I meant it as primarily an example people can learn from when they
> >> want to write their own.
> >
> > I think `t/t0021/rot13-filter.pl` (part of this patch) serves this purpose 
> > already.
> 
> I would expect them to peek at contrib/, but do you seriously expect
> people to comb through t/ directory?

How about a filter written in C, and placed in ./t/helper/ ?
At least I feel that a filter in C-language could be a starting point
for others which prefer, depending on the task the filter is going to do,
to use shell scripts, perl, python or any other high-level language.

A test case, where data can not be filtered, would be a minimum.
As Jakub pointed out, you can use iconv with good and bad data.


Re: [PATCH v6 12/13] convert: add filter..process option

2016-08-30 Thread Junio C Hamano
Lars Schneider  writes:

> On 30 Aug 2016, at 20:59, Junio C Hamano  wrote:
>> 
>>> "abort" could be ambiguous because it could be read as "abort only
>>> this file". "abort-all" would work, though. Would you prefer to see
>>> "error" replaced by "abort" and "error-all" by "abort-all"?
>> 
>> No.
>> 
>> I was primarily reacting to "-all" part, so anything that ends with
>> "-all" is equally ugly from my point of view and not an improvement.
>> 
>> As I said, "error-all" as long as other reviewers are happy with is
>> OK by me, too.
>
> Now, I see your point. How about "error-and-stop" or "error-stop"
> instead of "error-all"?

Not really.  I was primarily reacting to having "error" and
"error-all", that share the same prefix.  Changing "-all" suffix to
"-stop" does not make all that difference to me.  But in any case,
as long as other reviewers are happy with it, it is OK by me, too.

> Good argument. However, my intention here was to mimic the v1 filter
> mechanism.

I am not making any argument and in no way against the chosen
behaviour.  That "intention here" that was revealed after two
iterations is something we would want to see as the reasoning
behind the choice of the final behaviour recorded somewhere,
and now I drew it out of you, I achieved what I set out to do
initially ;-)

> I am not sure I understand your last sentence. Just to be clear:
> Would you prefer it, if Git would just close the pipe to the filter process
> on Git exit and leave the filter running?

As a potential filter writer, I would probably feel it the easiest
to handle if there is no signal involved.  My job would be to read
from the command pipe, react to it, and when the command pipe gives
me EOF, I am expected to exit myself.  That would be simple enough.

>> I meant it as primarily an example people can learn from when they
>> want to write their own.
>
> I think `t/t0021/rot13-filter.pl` (part of this patch) serves this purpose 
> already.

I would expect them to peek at contrib/, but do you seriously expect
people to comb through t/ directory?


Re: [PATCH v6 12/13] convert: add filter..process option

2016-08-30 Thread Jakub Narębski
Hello,

W dniu 30.08.2016 o 20:59, Junio C Hamano pisze:
> Lars Schneider  writes:
[...]

>> The filter can exit right after the "error-all". If the filter does
>> not exit then Git will kill the filter. I'll add this to the docs.
> 
> OK.

Is it explicit kill, or implicit kill: close pipe and end process?

>> "abort" could be ambiguous because it could be read as "abort only
>> this file". "abort-all" would work, though. Would you prefer to see
>> "error" replaced by "abort" and "error-all" by "abort-all"?
> 
> No.
> 
> I was primarily reacting to "-all" part, so anything that ends with
> "-all" is equally ugly from my point of view and not an improvement.
> 
> As I said, "error-all" as long as other reviewers are happy with is
> OK by me, too.

I'm rather partial to "abort" instead of "error-all", or "quit"/"exit"
(but I prefer "abort" or "bail-out"), as it better reflects what this
response is about - ending filter process.
 
>> A filter that dies during communication or does not adhere to the protocol
>> is a faulty filter. Feeding the faulty filter after restart with the same 
>> blob would likely cause the same error. 
> 
> Why does Git restart it and continue with the rest of the blobs
> then?  Is there a reason to believe it may produce correct results
> for other blobs if you run it again?

I guess the idea is that single filter can be run on different types
of blobs, and it could fail on some types (some files) and not others.
Like LFS-type filter failing on files with size larger than 2 GiB,
or iconv-like filter to convert UTF-16 to UTF-8 failing on invalid
byte sequences.

>> B) If we communicate "shutdown" to the filter then we need to give the
>>filter some time to perform the exit before the filter is killed on
>>Git exit. I wasn't able to come up with a good answer how long Git 
>>should wait for the exit.
> 
> Would that be an issue?  A filter is buggy if told to shutdown,
> ignores the request and hangs around forever.  I do not think we
> even need to kill it.  It is not Git's problem.

I think the idea was for Git to request shutdown, and filter to tell
when it finished (or just exiting, and Git getting SIGCHLD, I guess).
It is hard to tell how much to wait, for example for filter process
to send "sync" command, or finish unmounting, or something...

> I personally would be reluctant to become a filter process writer if
> the system it will be talking to can kill it without giving it a
> chance to do a graceful exit, but perhaps that is just me.  I don't
> know if closing the pipe going there where you are having Git to
> kill the process on the other end is any more involved than what you
> have without extra patches.

Isn't it the same problem with v1 filters being killed on Git process
exit?  Unless v2 filter wants to do something _after_ writing output
to Git, it should be safe... unless Git process is killed, but it
is not different from filter being stray-killed.

 +Please note that you cannot use an existing `filter..clean`
 +or `filter..smudge` command with `filter..process`
 +because the former two use a different inter process communication
 +protocol than the latter one.
>>>
>>> Would it be a useful sample program we can ship in contrib/ if you
>>> created a "filter adapter" that reads these two configuration
>>> variables and act as a filter..process?
>>
>> You mean a v2 filter that would use v1 filters under the hood?
>> If we would drop v1, then this would be useful. Otherwise I don't
>> see any need for such a thing.
> 
> I meant it as primarily an example people can learn from when they
> want to write their own.

Errr... if it were any v1 filter, it would be useless (as bad or
worse performance than ordinary v1 clean or smudge filter).  It
might make sense if v1 filter and v2 wrapper were written in the
same [dynamic] language, and wrapper could just load / require
filter as a function / procedure, [compile it], and use it.
For example smudge/clean filter in Perl (e.g. rot13), and v2 wrapper
in Perl too.

Best,
-- 
Jakub Narębski 
 



Re: [PATCH v6 12/13] convert: add filter..process option

2016-08-30 Thread Lars Schneider

On 30 Aug 2016, at 20:59, Junio C Hamano  wrote:
> 
>> "abort" could be ambiguous because it could be read as "abort only
>> this file". "abort-all" would work, though. Would you prefer to see
>> "error" replaced by "abort" and "error-all" by "abort-all"?
> 
> No.
> 
> I was primarily reacting to "-all" part, so anything that ends with
> "-all" is equally ugly from my point of view and not an improvement.
> 
> As I said, "error-all" as long as other reviewers are happy with is
> OK by me, too.

Now, I see your point. How about "error-and-stop" or "error-stop"
instead of "error-all"?


>> A filter that dies during communication or does not adhere to the protocol
>> is a faulty filter. Feeding the faulty filter after restart with the same 
>> blob would likely cause the same error. 
> 
> Why does Git restart it and continue with the rest of the blobs
> then?  Is there a reason to believe it may produce correct results
> for other blobs if you run it again?

Good argument. However, my intention here was to mimic the v1 filter
mechanism. If a filter fails then Git will run the same v1 filter on the
next file that needs to be filtered. If `filter..required=false`
then a failure is even a legitimate result.


>> B) If we communicate "shutdown" to the filter then we need to give the
>>   filter some time to perform the exit before the filter is killed on
>>   Git exit. I wasn't able to come up with a good answer how long Git 
>>   should wait for the exit.
> 
> Would that be an issue?  A filter is buggy if told to shutdown,
> ignores the request and hangs around forever.  I do not think we
> even need to kill it.  It is not Git's problem.
> I personally would be reluctant to become a filter process writer if
> the system it will be talking to can kill it without giving it a
> chance to do a graceful exit, but perhaps that is just me.  I don't
> know if closing the pipe going there where you are having Git to
> kill the process on the other end is any more involved than what you
> have without extra patches.

I am not sure I understand your last sentence. Just to be clear:
Would you prefer it, if Git would just close the pipe to the filter process
on Git exit and leave the filter running? I think I wouldn't even need
to close the pipe explicitly. The pipe would be closed automatically when
git exits or dies, right? The filter could detect EOF and exit on its own.
That would be OK with me.


 +Please note that you cannot use an existing `filter..clean`
 +or `filter..smudge` command with `filter..process`
 +because the former two use a different inter process communication
 +protocol than the latter one.
>>> 
>>> Would it be a useful sample program we can ship in contrib/ if you
>>> created a "filter adapter" that reads these two configuration
>>> variables and act as a filter..process?
>> 
>> You mean a v2 filter that would use v1 filters under the hood?
>> If we would drop v1, then this would be useful. Otherwise I don't
>> see any need for such a thing.
> 
> I meant it as primarily an example people can learn from when they
> want to write their own.

I think `t/t0021/rot13-filter.pl` (part of this patch) serves this purpose 
already.


Thanks,
Lars


Re: [PATCH v6 12/13] convert: add filter..process option

2016-08-30 Thread Junio C Hamano
Lars Schneider  writes:

>> This part of the document is well-written to help filter-writers.
>
> Thanks!

Don't thank me; thank yourself and reviewers of the previous rounds.


> The filter can exit right after the "error-all". If the filter does
> not exit then Git will kill the filter. I'll add this to the docs.

OK.

> "abort" could be ambiguous because it could be read as "abort only
> this file". "abort-all" would work, though. Would you prefer to see
> "error" replaced by "abort" and "error-all" by "abort-all"?

No.

I was primarily reacting to "-all" part, so anything that ends with
"-all" is equally ugly from my point of view and not an improvement.

As I said, "error-all" as long as other reviewers are happy with is
OK by me, too.

> A filter that dies during communication or does not adhere to the protocol
> is a faulty filter. Feeding the faulty filter after restart with the same 
> blob would likely cause the same error. 

Why does Git restart it and continue with the rest of the blobs
then?  Is there a reason to believe it may produce correct results
for other blobs if you run it again?

> B) If we communicate "shutdown" to the filter then we need to give the
>filter some time to perform the exit before the filter is killed on
>Git exit. I wasn't able to come up with a good answer how long Git 
>should wait for the exit.

Would that be an issue?  A filter is buggy if told to shutdown,
ignores the request and hangs around forever.  I do not think we
even need to kill it.  It is not Git's problem.

I personally would be reluctant to become a filter process writer if
the system it will be talking to can kill it without giving it a
chance to do a graceful exit, but perhaps that is just me.  I don't
know if closing the pipe going there where you are having Git to
kill the process on the other end is any more involved than what you
have without extra patches.

>>> +Please note that you cannot use an existing `filter..clean`
>>> +or `filter..smudge` command with `filter..process`
>>> +because the former two use a different inter process communication
>>> +protocol than the latter one.
>> 
>> Would it be a useful sample program we can ship in contrib/ if you
>> created a "filter adapter" that reads these two configuration
>> variables and act as a filter..process?
>
> You mean a v2 filter that would use v1 filters under the hood?
> If we would drop v1, then this would be useful. Otherwise I don't
> see any need for such a thing.

I meant it as primarily an example people can learn from when they
want to write their own.


Re: [PATCH v6 12/13] convert: add filter..process option

2016-08-30 Thread Lars Schneider

> On 30 Aug 2016, at 00:21, Junio C Hamano  wrote:
> 
> larsxschnei...@gmail.com writes:
> 
>> +In case the filter cannot or does not want to process the content,
>> +it is expected to respond with an "error" status. Depending on the
>> +`filter..required` flag Git will interpret that as error
>> +but it will not stop or restart the filter process.
>> +
>> +packet:  git< status=error\n
>> +packet:  git< 
>> +
>> +
>> +In case the filter cannot or does not want to process the content
>> +as well as any future content for the lifetime of the Git process,
>> +it is expected to respond with an "error-all" status. Depending on
>> +the `filter..required` flag Git will interpret that as error
>> +but it will not stop or restart the filter process.
>> +
>> +packet:  git< status=error-all\n
>> +packet:  git< 
>> +
> 
> This part of the document is well-written to help filter-writers.

Thanks!


> One thing that was unclear from the above to me, when read as a
> potential filter-writer, is when I am supposed to exit(2).  After I
> tell Git with error-all (I would have called it "abort", but that's
> OK) that I desire no further communication, am I free to go?  Or do
> I wait until Git somehow disconnects (perhaps by closing the packet
> stream I have been reading)?

The filter can exit right after the "error-all". If the filter does
not exit then Git will kill the filter. I'll add this to the docs.

"abort" could be ambiguous because it could be read as "abort only
this file". "abort-all" would work, though. Would you prefer to see
"error" replaced by "abort" and "error-all" by "abort-all"?


>> +If the filter dies during the communication or does not adhere to
>> +the protocol then Git will stop the filter process and restart it
>> +with the next file that needs to be processed.
> 
> Hmph, is there a reason not to retry a half-converted-and-failed
> blob with the fresh process?  Note that this is not "you must do it
> that way", and it is not even "I think doing so may be a better
> idea".  I merely want to know the reason behind this decision.

A filter that dies during communication or does not adhere to the protocol
is a faulty filter. Feeding the faulty filter after restart with the same 
blob would likely cause the same error. 

There are legitimate reasons for retries. E.g. if the filter communicates
with the network. In these cases I expect the filter to handle the retry
logic. Git just writes to and reads from pipes. I don't expect frequent
problems in that area. Plus the existing filter mechanism has no retry
either.

Later on we could easily add a "retry" capability if we deem it necessary,
though.


>> +After the filter has processed a blob it is expected to wait for
>> +the next "key=value" list containing a command. When the Git process
>> +terminates, it will send a kill signal to the filter in that stage.
> 
> The "kill" may not be very nice.  As Git side _knows_ that the
> filter is waiting for the next command, having an explicit
> "shutdown" command would give the filter a chance to implement a
> clean exit--it may have some housekeeping tasks it wants to perform
> once it is done.  The "explicit shutdown" could just be "the pipe
> gets closed", so from the implementation point of view there may not
> be anything you need to further add to this patch (after all, when
> we exit, the pipes to them would be closed), but the shutdown
> protocol and the expectation on the behaviour of filter processes
> would need to be documented.

I implemented a shutdown command in v4 [1][2] but dropped it in v5 after
a discussion with Peff [3].

[1] http://public-inbox.org/git/20160803164225.46355-8-larsxschnei...@gmail.com/
[2] 
http://public-inbox.org/git/20160803164225.46355-13-larsxschnei...@gmail.com/
[3] 
http://public-inbox.org/git/20160803225313.pk3tfe5ovz4y3...@sigill.intra.peff.net/

My main reasons to drop it:

A) There is no central place in Git that could execute code *after*
   all filter operations are complete and *before* Git exits. Therefore,
   I had to add a "clean_on_exit_handler()" to "run-command" [1]. This
   change made this series even larger and therefore harder to review.

B) If we communicate "shutdown" to the filter then we need to give the
   filter some time to perform the exit before the filter is killed on
   Git exit. I wasn't able to come up with a good answer how long Git 
   should wait for the exit.

Do you think I should resurrect the "shutdown" patch?


>> +If a `filter..clean` or `filter..smudge` command
>> +is configured then these commands always take precedence over
>> +a configured `filter..process` command.
> 
> It may make more sense to give precedence to the .process (which is
> a late-comer) if defined, ignoring .clean and .smudge, than the
> other way around.

I agree.


>> +Please note that you cannot use an existing `filter..cle

Re: [PATCH v6 12/13] convert: add filter..process option

2016-08-29 Thread Junio C Hamano
larsxschnei...@gmail.com writes:

> +In case the filter cannot or does not want to process the content,
> +it is expected to respond with an "error" status. Depending on the
> +`filter..required` flag Git will interpret that as error
> +but it will not stop or restart the filter process.
> +
> +packet:  git< status=error\n
> +packet:  git< 
> +
> +
> +In case the filter cannot or does not want to process the content
> +as well as any future content for the lifetime of the Git process,
> +it is expected to respond with an "error-all" status. Depending on
> +the `filter..required` flag Git will interpret that as error
> +but it will not stop or restart the filter process.
> +
> +packet:  git< status=error-all\n
> +packet:  git< 
> +

This part of the document is well-written to help filter-writers.

One thing that was unclear from the above to me, when read as a
potential filter-writer, is when I am supposed to exit(2).  After I
tell Git with error-all (I would have called it "abort", but that's
OK) that I desire no further communication, am I free to go?  Or do
I wait until Git somehow disconnects (perhaps by closing the packet
stream I have been reading)?

> +If the filter dies during the communication or does not adhere to
> +the protocol then Git will stop the filter process and restart it
> +with the next file that needs to be processed.

Hmph, is there a reason not to retry a half-converted-and-failed
blob with the fresh process?  Note that this is not "you must do it
that way", and it is not even "I think doing so may be a better
idea".  I merely want to know the reason behind this decision.

> +After the filter has processed a blob it is expected to wait for
> +the next "key=value" list containing a command. When the Git process
> +terminates, it will send a kill signal to the filter in that stage.

The "kill" may not be very nice.  As Git side _knows_ that the
filter is waiting for the next command, having an explicit
"shutdown" command would give the filter a chance to implement a
clean exit--it may have some housekeeping tasks it wants to perform
once it is done.  The "explicit shutdown" could just be "the pipe
gets closed", so from the implementation point of view there may not
be anything you need to further add to this patch (after all, when
we exit, the pipes to them would be closed), but the shutdown
protocol and the expectation on the behaviour of filter processes
would need to be documented.

> +If a `filter..clean` or `filter..smudge` command
> +is configured then these commands always take precedence over
> +a configured `filter..process` command.

It may make more sense to give precedence to the .process (which is
a late-comer) if defined, ignoring .clean and .smudge, than the
other way around.

> +Please note that you cannot use an existing `filter..clean`
> +or `filter..smudge` command with `filter..process`
> +because the former two use a different inter process communication
> +protocol than the latter one.

Would it be a useful sample program we can ship in contrib/ if you
created a "filter adapter" that reads these two configuration
variables and act as a filter..process?

During an imaginary session of "git add .", I think I found where
you start THE filter process upon the first path that needs to be
filtered with one for the configured , and I think the same
place is where you reuse THE filter process, but I am not sure where
you are cleaning up by killing the filter once all paths are added.
Wouldn't you need some hooks at strategic places after such bulk
operation to tell the multi-file-filter machinery to walk all the
entries in cmd_process_map and tell the remaining filter processes
that they have no more tasks, or something?  Are you relying on
these processes to exit upon a read failure after we exit and the
pipe going to the filter is severed?

Thanks.


[PATCH v6 12/13] convert: add filter..process option

2016-08-25 Thread larsxschneider
From: Lars Schneider 

Git's clean/smudge mechanism invokes an external filter process for
every single blob that is affected by a filter. If Git filters a lot of
blobs then the startup time of the external filter processes can become
a significant part of the overall Git execution time.

In a preliminary performance test this developer used a clean/smudge
filter written in golang to filter 12,000 files. This process took 364s
with the existing filter mechanism and 5s with the new mechanism. See
details here: https://github.com/github/git-lfs/pull/1382

This patch adds the `filter..process` string option which, if
used, keeps the external filter process running and processes all blobs
with the packet format (pkt-line) based protocol over standard input and
standard output. The full protocol is explained in detail in
`Documentation/gitattributes.txt`.

A few key decisions:

* The long running filter process is referred to as filter protocol
  version 2 because the existing single shot filter invocation is
  considered version 1.
* Git sends a welcome message and expects a response right after the
  external filter process has started. This ensures that Git will not
  hang if a version 1 filter is incorrectly used with the
  filter..process option for version 2 filters. In addition,
  Git can detect this kind of error and warn the user.
* The status of a filter operation (e.g. "success" or "error) is set
  before the actual response and (if necessary!) re-set after the
  response. The advantage of this two step status response is that if
  the filter detects an error early, then the filter can communicate
  this and Git does not even need to create structures to read the
  response.
* All status responses are pkt-line lists terminated with a flush
  packet. This allows us to send other status fields with the same
  protocol in the future.

Helped-by: Martin-Louis Bright 
Reviewed-by: Jakub Narebski 
Signed-off-by: Lars Schneider 
---
 Documentation/gitattributes.txt | 146 +++-
 convert.c   | 349 +
 pkt-line.h  |   1 +
 t/t0021-conversion.sh   | 372 
 t/t0021/rot13-filter.pl | 176 +++
 unpack-trees.c  |   1 +
 6 files changed, 1015 insertions(+), 30 deletions(-)
 create mode 100755 t/t0021/rot13-filter.pl

diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index 8882a3e..6346700 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -300,7 +300,13 @@ checkout, when the `smudge` command is specified, the 
command is
 fed the blob object from its standard input, and its standard
 output is used to update the worktree file.  Similarly, the
 `clean` command is used to convert the contents of worktree file
-upon checkin.
+upon checkin. By default these commands process only a single
+blob and terminate.  If a long running `process` filter is used
+in place of `clean` and/or `smudge` filters, then Git can process
+all blobs with a single filter command invocation for the entire
+life of a single Git command, for example `git add --all`.  See
+section below for the description of the protocol used to
+communicate with a `process` filter.
 
 One use of the content filtering is to massage the content into a shape
 that is more convenient for the platform, filesystem, and the user to use.
@@ -375,6 +381,144 @@ substitution.  For example:
 
 
 
+Long Running Filter Process
+^^^
+
+If the filter command (a string value) is defined via
+`filter..process` then Git can process all blobs with a
+single filter invocation for the entire life of a single Git
+command. This is achieved by using the following packet format
+(pkt-line, see technical/protocol-common.txt) based protocol over
+standard input and standard output.
+
+Git starts the filter when it encounters the first file
+that needs to be cleaned or smudged. After the filter started
+Git sends a welcome message ("git-filter-client"), a list of
+supported protocol version numbers, and a flush packet. Git expects
+to read a welcome response message ("git-filter-server") and exactly
+one protocol version number from the previously sent list. All further
+communication will be based on the selected version. The remaining
+protocol description below documents "version=2". Please note that
+"version=42" in the example below does not exist and is only there
+to illustrate how the protocol would look like with more than one
+version.
+
+After the version negotiation Git sends a list of supported capabilities
+and a flush packet. Git expects to read a list of desired capabilities,
+which must be a subset of the supported capabilities list, and a flush
+packet as response:
+
+packet:  git> git-filter-client
+packet:  git> version=2
+packet:  git> versio