Re: [PATCH v8 11/11] convert: add filter..process option

2016-10-06 Thread Lars Schneider

> On 04 Oct 2016, at 23:00, Jakub Narębski  wrote:
> 
> [Some of answers and comments may got invalidated by v9]
> 
> W dniu 30.09.2016 o 21:38, Lars Schneider pisze:
>>> On 27 Sep 2016, at 17:37, Jakub Narębski  wrote:
>>> 
>>> Part second of the review of 11/11.
> [...]
 +
 +  if (start_command(process)) {
 +  error("cannot fork to run external filter '%s'", cmd);
 +  kill_multi_file_filter(hashmap, entry);
 +  return NULL;
 +  }
>>> 
>>> I guess there is a reason why we init hashmap entry, try to start
>>> external process, then kill entry of unable to start, instead of
>>> trying to start external process, and adding hashmap entry when
>>> we succeed?
>> 
>> Yes. This way I can reuse the kill_multi_file_filter() function.
> 
> I don't quite understand.  If you didn't fill the entry before
> using start_command(process), you would not need kill_multi_file_filter(),
> which in that case IIUC just removes the just created entry from hashmap.
> Couldn't you add entry to hashmap in the 'else' part?  Or would it
> be racy?

You are right. I'll fix that.


>> 
 +  if (pair[0] && pair[0]->len && pair[1]) {
 +  if (!strcmp(pair[0]->buf, "status=")) {
 +  strbuf_reset(status);
 +  strbuf_addbuf(status, pair[1]);
 +  }
>>> 
>>> So it is last status= line wins behavior?
>> 
>> Correct.
> 
> Perhaps this should be described in code comment.

OK


 
 +  fflush(NULL);
>>> 
>>> Why this fflush(NULL) is needed here?
>> 
>> This flushes all open output streams. The single filter does the same.
> 
> I know what it does, but I don't know why.  But "single filter does it"
> is good enough for me.  Still would want to know why, though ;-)

TBH I am not 100% sure why, too. I think this ensures that we don't have 
any outdated/unrelated/previous data in the stream buffers.


 
 +  if (fd >= 0 && !src) {
 +  if (fstat(fd, _stat) == -1)
 +  return 0;
 +  len = xsize_t(file_stat.st_size);
 +  }
>>> 
>>> Errr... is it necessary?  The protocol no longer provides size=
>>> hint, and neither uses such hint if provided.
>> 
>> We require the size in write_packetized_from_buf() later.
> 
> Don't we use write_packetized_from_fd() in the case of fd >= 0?

Of course! Ah too many refactorings :-)
I'll remove that.

Thank you,
Lars



Re: [PATCH v8 11/11] convert: add filter..process option

2016-10-06 Thread Lars Schneider

> On 04 Oct 2016, at 22:50, Jakub Narębski  wrote:
> 
> [Some of answers may get invalidated by v9]
> 
> W dniu 30.09.2016 o 20:56, Lars Schneider pisze:
>>> On 27 Sep 2016, at 00:41, Jakub Narębski  wrote:
>>> 
 +
 +After the filter has processed a blob it is expected to wait for
 +the next "key=value" list containing a command. Git will close
 +the command pipe on exit. The filter is expected to detect EOF
 +and exit gracefully on its own.
> 
> Is this still true?

Yes


>>> 
>>> Good to have it documented.  
>>> 
>>> Anyway, as it is Git command that spawns the filter driver process,
>>> assuming that the filter process doesn't daemonize itself, wouldn't
>>> the operating system reap it after its parent process, that is the
>>> git command it invoked, dies? So detecting EOF is good, but not
>>> strictly necessary for simple filter that do not need to free
>>> its resources, or can leave freeing resources to the operating
>>> system? But I may be wrong here.
>> 
>> The filter process runs independent of Git.
> 
> Ah.  So without some way to tell long-lived filter process that
> it can shut down, because no further data will be incoming, or
> killing it by Git, it would hang indefinitely?

Yes

- Lars

Re: [PATCH v8 11/11] convert: add filter..process option

2016-10-04 Thread Jakub Narębski
[Some of answers and comments may got invalidated by v9]

W dniu 01.10.2016 o 17:34, Lars Schneider pisze:
>> On 29 Sep 2016, at 01:14, Jakub Narębski  wrote:
>>
>> Part third (and last) of the review of v8 11/11.

[...]
>>> +
>>> +test_expect_success PERL 'required process filter should filter data' '
>>> +   test_config_global filter.protocol.process 
>>> "$TEST_DIRECTORY/t0021/rot13-filter.pl clean smudge" &&
>>> +   test_config_global filter.protocol.required true &&
>>> +   rm -rf repo &&
>>> +   mkdir repo &&
>>> +   (
>>> +   cd repo &&
>>> +   git init &&
>>
>> Don't you think that creating a fresh test repository for each
>> separate test is a bit too much?  I guess that you want for
>> each and every test to be completely independent, but this setup
>> and teardown is a bit excessive.
>>
>> Other tests in the same file (should we reuse the test, or use
>> new test file) do not use this method.
> 
> I see your point. However, I am always annoyed if Git tests are
> entangled because it makes working with them way way harder.
> This test test runs in 4.5s on a slow Travis CI machine. I think
> that is OK considering that we have tests running 3.5min (t3404).

All right, this is good argument... though inconsistency (assuming
that we don't switch to separate test for multi-file filter) could
be argument against.

Though I wonder if test preparation could be extracted into a
common function...

[...]
>>> +   cp ../test.o test.r &&
>>> +   cp ../test2.o test2.r &&
>>
>> What does this test2.o / test2.r file tests, that test.o / test.r
>> doesn't?  The name doesn't tell us.
> 
> This just tests multiple files with different content.

All right, so it is here to test multiple files (and that filter
actually process multiple files).

>> Why it is test.r, but test2.r?  Why it isn't test1.r?
> 
> test.r already existed (created in setup test).

With each test in separate repository we could copy test.r prepared
in 'setup' into test1.r in each of multi-file tests.

[...]
>>> +   >test4-empty.r &&
>>
>> You test ordinary file, file in subdirectory, file with filename
>> containing spaces, and an empty file.
>>
>> Other tests of single file `clean`/`smudge` filters use filename
>> that requires mangling; maybe we should use similar file?
>>
>>special="name  with '\''sq'\'' and \$x" &&
>>echo some test text >"$special" &&
> 
> OK.

This test was required because of %f to pass filename as parameter
coupled with the fact that we use `clean` and `smudge` as shell
script fragment (so e.g. pipes and redirection would work in
one-shot filter definition).

This is not the case with multi-file filter, where filenames are
passed internally, and we don't need to worry about shell quoting
at all.
 
>> In case of `process` filter, a special filename could look like
>> this:
>>
>>process_special="name=with equals and\nembedded newlines\n" &&
>>echo some test text >"$process_special" &&
> 
> I think this test would create trouble on Windows. I'll stick to
> the special characters used in the single shot filter.

This would test... example parser.  Well, all right, better not
give problems for Windows.

But... you can create such files in Git Bash:

  $ touch "$(echo -n -e "fo=o\nbar\n")"

and though they look strange

  $ ls -1 fo*
  'fo=o'$'\n''bar'

but work all right

  $ echo "strangest" >>"$(echo -n -e "fo=o\nbar\n")"
  $ name="$(echo -n -e "fo=o\nbar\n")"
  $ cat "$name"
  strangest

>> Third, why the filter even writes output size? It is no longer
>> part of `process` filter driver protocol, and it makes test more
>> fragile.
> 
> I would prefer to leave that in. I think it is good for the test to
> check that we are transmitting the amount of content that what we 
> think we transmit.

Right, we test that we processed full file this way, in the multi
packet test. 

>>> +   <<-\EOF &&
>>> +   START
>>> +   wrote filter header
>>> +   STOP
>>> +   EOF
>>
>> Why is even filter process invoked?  If this is not expected, perhaps
>> simply ignore what checking out almost empty branch (one without any
>> files marked for filtering) does.
>>
>> Shouldn't we test_expect_failure no-call?
> 
> Because a clean operation could happen. I added a clean operation to
> the expected log in order to make this visible (expected log is stripped
> of clean operations in the same way as the actual log per your suggestion
> above).

If we are testing that if there is no "smudge" capability, then
there were no "smudge" operations, why we don't test just that:
that grepping for "smudge" in long doesn't find anything.

Current version feels convoluted (and would stop working if Git
is improved to not run "clean" in this case for optimization).
 
>>> +
>>> +   check_filter_ignore_clean \
>>> +  

Re: [PATCH v8 11/11] convert: add filter..process option

2016-10-04 Thread Jakub Narębski
[Some of answers and comments may got invalidated by v9]

W dniu 30.09.2016 o 21:38, Lars Schneider pisze:
>> On 27 Sep 2016, at 17:37, Jakub Narębski  wrote:
>>
>> Part second of the review of 11/11.
[...]
>>> +
>>> +   if (start_command(process)) {
>>> +   error("cannot fork to run external filter '%s'", cmd);
>>> +   kill_multi_file_filter(hashmap, entry);
>>> +   return NULL;
>>> +   }
>>
>> I guess there is a reason why we init hashmap entry, try to start
>> external process, then kill entry of unable to start, instead of
>> trying to start external process, and adding hashmap entry when
>> we succeed?
> 
> Yes. This way I can reuse the kill_multi_file_filter() function.

I don't quite understand.  If you didn't fill the entry before
using start_command(process), you would not need kill_multi_file_filter(),
which in that case IIUC just removes the just created entry from hashmap.
Couldn't you add entry to hashmap in the 'else' part?  Or would it
be racy?

[...]
>>> +static void read_multi_file_filter_values(int fd, struct strbuf *status) {
>>
>> This is more
>>
>>  +static void read_multi_file_filter_status(int fd, struct strbuf *status) {
>>
>> It doesn't read arbitrary values, it examines 'metadata' from
>> filter for "status=" lines.
> 
> True!
>
>>> +   if (pair[0] && pair[0]->len && pair[1]) {
>>> +   if (!strcmp(pair[0]->buf, "status=")) {
>>> +   strbuf_reset(status);
>>> +   strbuf_addbuf(status, pair[1]);
>>> +   }
>>
>> So it is last status= line wins behavior?
> 
> Correct.

Perhaps this should be described in code comment.
 
>>>
>>> +   fflush(NULL);
>>
>> Why this fflush(NULL) is needed here?
> 
> This flushes all open output streams. The single filter does the same.

I know what it does, but I don't know why.  But "single filter does it"
is good enough for me.  Still would want to know why, though ;-)
 
>>>
>>> +   if (fd >= 0 && !src) {
>>> +   if (fstat(fd, _stat) == -1)
>>> +   return 0;
>>> +   len = xsize_t(file_stat.st_size);
>>> +   }
>>
>> Errr... is it necessary?  The protocol no longer provides size=
>> hint, and neither uses such hint if provided.
> 
> We require the size in write_packetized_from_buf() later.

Don't we use write_packetized_from_fd() in the case of fd >= 0?

[...]

Best,
-- 
Jakub Narębski



Re: [PATCH v8 11/11] convert: add filter..process option

2016-10-04 Thread Jakub Narębski
[Some of answers may get invalidated by v9]

W dniu 30.09.2016 o 20:56, Lars Schneider pisze:
>> On 27 Sep 2016, at 00:41, Jakub Narębski  wrote:
>>
>> Part first of the review of 11/11.

[...]
>>> +to read a welcome response message ("git-filter-server") and exactly
>>> +one protocol version number from the previously sent list. All further
>>
>> I guess that is to provide forward-compatibility, isn't it?  Also,
>> "Git expects..." probably means filter process MUST send, in the
>> RFC2119 (https://tools.ietf.org/html/rfc2119) meaning.
> 
> True. I feel "expects" reads better but I am happy to change it if
> you feel strong about it.

I don't have strong opinion about this.  I guess following the example
of pkt-line format description may be a good idea.  We are not writing
an RFC here... but having be explicit is better than be good read :-P

>>> +
>>> +After the version negotiation Git sends a list of supported capabilities
>>> +and a flush packet.
>>
>> Is it that Git SHOULD send list of ALL supported capabilities, or is
>> it that Git SHOULD NOT send capabilities it does not support, and that
>> it MAY send only those capabilities it needs (so for example if command
>> uses only `smudge`, it may not send `clean`, so that filter driver doesn't
>> need to initialize data it would not need).
> 
> "After the version negotiation Git sends a list of all capabilities that
> it supports and a flush packet."
> 
> Better?

Better.

I wonder if it is a matter of current implementation, namely that at
the point of code where Git is sending capabilities it doesn't know
which of them it will be using; at least in some of cases.

Because if it would be possible for Git to not send capabilities which
it supports, but is sure that it would not be using for a given operation,
then it would be good to do that.  It might be that filter driver must
do some prep work for each of capabilities, so skipping some of prep
would make git faster.  Though all this is for now theoretical musings...
it might be an argument for such description of protocol so it does
not prevent Git from sending only supported capabilities it needs.

>> I wonder why it is "=true", and not "capability=".
>> Is there a case where we would want to send "=false".  Or
>> is it to allow configurable / value based capabilities?  Isn't it going
>> a bit too far: is there even a hind of an idea for parametrize-able
>> capability? YAGNI is a thing...
> 
> Peff suggested that format and I think it is OK:
> http://public-inbox.org/git/20160803224619.bwtbvmslhuicx...@sigill.intra.peff.net/

It wouldn't kill you to summarize the argument, would it?

>From what I understand the argument is that "=true" allows
for simplist parsing into a [intermediate] hash, while the other
proposal of using"capability=" would require something more
sophisticated.  And that it is better to be explicit rather than
use synonyms / shortcuts for "".

Though one can argue that "" is synonym / shortcut for "=true";
it would not complicate parsing much.

Nb. we don't use trick of 'parse metadata to hash' in neither example,
nor filter driver used in test...

[...]
>>> +
>>> +After the filter has processed a blob it is expected to wait for
>>> +the next "key=value" list containing a command. Git will close
>>> +the command pipe on exit. The filter is expected to detect EOF
>>> +and exit gracefully on its own.

Is this still true?

>>
>> Good to have it documented.  
>>
>> Anyway, as it is Git command that spawns the filter driver process,
>> assuming that the filter process doesn't daemonize itself, wouldn't
>> the operating system reap it after its parent process, that is the
>> git command it invoked, dies? So detecting EOF is good, but not
>> strictly necessary for simple filter that do not need to free
>> its resources, or can leave freeing resources to the operating
>> system? But I may be wrong here.
> 
> The filter process runs independent of Git.

Ah.  So without some way to tell long-lived filter process that
it can shut down, because no further data will be incoming, or
killing it by Git, it would hang indefinitely?

[...]
>>> +}
>>> +elsif ( $pkt_size > 4 ) {
>>
>> Isn't a packet of $pkt_size == 4 a valid packet, a keep-alive
>> one?  Or is it forbidden?
> 
> "Implementations SHOULD NOT send an empty pkt-line ("0004")."
> Source: Documentation/technical/protocol-common.txt

All right.  Not that we need keep-alive packet for communication
between two processes on the same host.

Though I wonder why this rule is here...

[...]
>>> +( packet_txt_read() eq ( 0, "git-filter-client" ) ) || die "bad 
>>> initialize";
>>> +( packet_txt_read() eq ( 0, "version=2" ) ) || die "bad version";
>>> +( packet_bin_read() eq ( 1, "" ) )  || die "bad version 
>>> end";
>>
>> Actually, it is overly strict.  It should not fail if there
>> are other "version=3", "version=4" etc. lines.
> 
> True, but I think for an example this is OK. I'll add a note

Re: [PATCH v8 11/11] convert: add filter..process option

2016-10-01 Thread Lars Schneider

> On 29 Sep 2016, at 01:14, Jakub Narębski  wrote:
> 
> Part third (and last) of the review of v8 11/11.
> 
> W dniu 20.09.2016 o 21:02, larsxschnei...@gmail.com napisał:
> 
> 
>> @@ -31,7 +31,10 @@ test_expect_success setup '
>>  cat test >test.i &&
>>  git add test test.t test.i &&
>>  rm -f test test.t test.i &&
>> -git checkout -- test test.t test.i
>> +git checkout -- test test.t test.i &&
>> +
>> +echo "content-test2" >test2.o &&
>> +echo "content-test3 - subdir" >"test3 - subdir.o"
> 
> I see that you prepare here a few uncommitted files, but both
> their names and their contents leave much to be desired - you
> don't know from the name and contents what they are for.
> 
> And the '"subdir"' file which is not in subdirectory is
> especially egregious.

These are 3 files with somewhat random test content. I renamed
"subdir" to "spaces".


>> +check_filter () {
>> +rm -f rot13-filter.log actual.log &&
>> +"$@" 2> git_stderr.log &&
>> +test_must_be_empty git_stderr.log &&
>> +cat >expected.log &&
> 
> This is too clever by half.  Having a function that both tests
> the behavior and prepares 'expected' file is too much.
> 
> In my opinion preparation of 'expected.log' file should be moved
> to another function or functions.
> 
> Also, if we are running sort on output, I think we should also
> run sort on 'expected.log', so that what we write doesn't need to
> be created sorted (so we don't have to sort expected lines by hand).
> Or maybe we should run the same transformation on rot13-filter.log
> and on the contents of expected.log.

Agreed. Very good suggestion!


>> +check_filter_ignore_clean () {
>> +rm -f rot13-filter.log actual.log &&
>> +"$@" &&
> 
> Why we don't check for stderr here?

Because this function is used by "git checkout" which writes all
kinds of stuff to stderr. I added "--quiet --no-progress" to
disable this behavior. 


>> +check_rot13 () {
>> +test_cmp "$1" "$2" &&
>> +./../rot13.sh <"$1" >expected &&
> 
> Why there is .. in this invocation?

Because this script is located in the root of the current test directory.


>> +git cat-file blob :"$2" >actual &&
>> +test_cmp expected actual
>> +}
>> +
>> +test_expect_success PERL 'required process filter should filter data' '
>> +test_config_global filter.protocol.process 
>> "$TEST_DIRECTORY/t0021/rot13-filter.pl clean smudge" &&
>> +test_config_global filter.protocol.required true &&
>> +rm -rf repo &&
>> +mkdir repo &&
>> +(
>> +cd repo &&
>> +git init &&
> 
> Don't you think that creating a fresh test repository for each
> separate test is a bit too much?  I guess that you want for
> each and every test to be completely independent, but this setup
> and teardown is a bit excessive.
> 
> Other tests in the same file (should we reuse the test, or use
> new test file) do not use this method.

I see your point. However, I am always annoyed if Git tests are
entangled because it makes working with them way way harder.
This test test runs in 4.5s on a slow Travis CI machine. I think
that is OK considering that we have tests running 3.5min (t3404).


>> +echo "*.r filter=protocol" >.gitattributes &&
>> +git add . &&
>> +git commit . -m "test commit" &&
>> +git branch empty &&
> 
> Err... I think it would be better to name it 'empty-branch'
> (or 'almost-empty-branch', as it does include .gitattributes file).
> See my mistake below (marked ...).

"empty-branch". OK


>> +
>> +cp ../test.o test.r &&
>> +cp ../test2.o test2.r &&
> 
> What does this test2.o / test2.r file tests, that test.o / test.r
> doesn't?  The name doesn't tell us.

This just tests multiple files with different content.


> Why it is test.r, but test2.r?  Why it isn't test1.r?

test.r already existed (created in setup test).


>> +mkdir testsubdir &&
>> +cp "../test3 - subdir.o" "testsubdir/test3 - subdir.r" &&
> 
> Why it needs to have different contents?

To check that the filer does the right thing with multiple files
and contents.



>> +>test4-empty.r &&
> 
> You test ordinary file, file in subdirectory, file with filename
> containing spaces, and an empty file.
> 
> Other tests of single file `clean`/`smudge` filters use filename
> that requires mangling; maybe we should use similar file?
> 
>special="name  with '\''sq'\'' and \$x" &&
>echo some test text >"$special" &&

OK.


> In case of `process` filter, a special filename could look like
> this:
> 
>process_special="name=with equals and\nembedded newlines\n" &&
>echo some test text >"$process_special" &&

I think this test would create trouble on Windows. I'll stick to
the special characters used in the single shot filter.


>> +<<-\EOF &&
>> +1 IN: clean test.r 57 [OK] -- OUT: 57 . 

Re: [PATCH v8 11/11] convert: add filter..process option

2016-09-30 Thread Lars Schneider

> On 27 Sep 2016, at 17:37, Jakub Narębski  wrote:
> 
> Part second of the review of 11/11.
> 
> W dniu 20.09.2016 o 21:02, larsxschnei...@gmail.com pisze:
> 
>> +
>> +if (!drv->process && (CAP_CLEAN & wanted_capability) && drv->clean)
> 
> This is just a very minor nitpicking, but wouldn't it be easier
> to read with those checks reordered?
> 
>  +if ((wanted_capability & CAP_CLEAN) && !drv->process && drv->clean)

OK


>> +
>> +if (start_command(process)) {
>> +error("cannot fork to run external filter '%s'", cmd);
>> +kill_multi_file_filter(hashmap, entry);
>> +return NULL;
>> +}
> 
> I guess there is a reason why we init hashmap entry, try to start
> external process, then kill entry of unable to start, instead of
> trying to start external process, and adding hashmap entry when
> we succeed?

Yes. This way I can reuse the kill_multi_file_filter() function.


>> +
>> +sigchain_push(SIGPIPE, SIG_IGN);
> 
> I guess that this is here to handle errors writing to filter
> by ourself, isn't it?

Yes.


>> +error("external filter '%s' does not support long running 
>> filter protocol", cmd);
> 
> We could have described the error here better.
> 
>  +error("external filter '%s' does not support filter protocol 
> version 2", cmd);

OK


>> +static void read_multi_file_filter_values(int fd, struct strbuf *status) {
> 
> This is more
> 
>  +static void read_multi_file_filter_status(int fd, struct strbuf *status) {
> 
> It doesn't read arbitrary values, it examines 'metadata' from
> filter for "status=" lines.

True!


>> +if (pair[0] && pair[0]->len && pair[1]) {
>> +if (!strcmp(pair[0]->buf, "status=")) {
>> +strbuf_reset(status);
>> +strbuf_addbuf(status, pair[1]);
>> +}
> 
> So it is last status= line wins behavior?

Correct.


> 
>> +}
> 
> Shouldn't we free 'struct strbuf **pair', maybe allocated by the
> strbuf_split_str() function, and reset to NULL?

True. strbuf_list_free() should be enough.


>> 
>> +fflush(NULL);
> 
> Why this fflush(NULL) is needed here?

This flushes all open output streams. The single filter does the same.


>> 
>> +if (fd >= 0 && !src) {
>> +if (fstat(fd, _stat) == -1)
>> +return 0;
>> +len = xsize_t(file_stat.st_size);
>> +}
> 
> Errr... is it necessary?  The protocol no longer provides size=
> hint, and neither uses such hint if provided.

We require the size in write_packetized_from_buf() later.


>> +
>> +err = strlen(filter_type) > PKTLINE_DATA_MAXLEN;
>> +if (err)
>> +goto done;
> 
> Errr... this should never happen.  We control which capabilities
> we pass, it can be only "clean" or "smudge", nothing else. Those
> would always be shorter than PKTLINE_DATA_MAXLEN.
> 
> Never mind that that is "command=smudge\n" etc. that needs to
> be shorter that PKTLINE_DATA_MAXLEN!
> 
> So, IMHO it should be at most assert, and needs to be corrected
> anyway.

OK!


> This should never happen, PATH_MAX everywhere is much shorter
> than PKTLINE_DATA_MAXLEN / LARGE_PACKET_MAX.  Or is it?
> 
> Anyway, we should probably explain or warn
> 
>   error("path name too long: '%s'", path);

OK


>> +/*
>> + * Something went wrong with the protocol filter.
>> + * Force shutdown and restart if another blob requires 
>> filtering!
> 
> Is this exclamation mark '!' here necessary?
> 

No.


Thanks,
Lars



Re: [PATCH v8 11/11] convert: add filter..process option

2016-09-30 Thread Lars Schneider

> On 27 Sep 2016, at 00:41, Jakub Narębski  wrote:
> 
> Part first of the review of 11/11.
> 
> W dniu 20.09.2016 o 21:02, larsxschnei...@gmail.com pisze:
>> From: Lars Schneider 
>> 
>> diff --git a/Documentation/gitattributes.txt 
>> b/Documentation/gitattributes.txt
>> index 7aff940..946dcad 100644
>> --- a/Documentation/gitattributes.txt
>> +++ b/Documentation/gitattributes.txt
>> @@ -293,7 +293,13 @@ checkout, when the `smudge` command is specified, the 
>> command is
>> fed the blob object from its standard input, and its standard
>> output is used to update the worktree file.  Similarly, the
>> `clean` command is used to convert the contents of worktree file
>> -upon checkin.
>> +upon checkin. By default these commands process only a single
>> +blob and terminate.  If a long running `process` filter is used
>   
> 
> Should we use this terminology here?  I have not read the preceding
> part of documentation, so I don't know if it talks about "blobs" or
> if it uses "files" and/or "file contents".

I used that because it was used in the paragraph above already.


>> +Long Running Filter Process
>> +^^^
>> +
>> +If the filter command (a string value) is defined via
>> +`filter..process` then Git can process all blobs with a
>> +single filter invocation for the entire life of a single Git
>> +command. This is achieved by using a packet format (pkt-line,
>> +see technical/protocol-common.txt) based protocol over standard
>> +input and standard output as follows. All packets are considered
>> +text and therefore are terminated by an LF. Exceptions are the
>> +"*CONTENT" packets and the flush packet.
> 
> I guess that reasoning here is that all but CONTENT packets are
> metadata, and thus to aid debuggability of the protocol are "text",
> as considered by pkt-line.
> 
> Perhaps a bit more readable would be the following (but current is
> just fine; I am nitpicking):
> 
>  All packets, except for the "{star}CONTENT" packets and the ""
>  flush packer, are considered text and therefore are terminated by
>  a LF.

OK, I use that!


> I think it might be a good idea to describe what flush packet is
> somewhere in this document; on the other hand referring (especially
> if hyperlinked) to pkt-line technical documentation might be good
> enough / better.  I'm unsure, but I tend on the side that referring
> to technical documentation is better.

I have this line in the first paragraph of the Long Running Filter process:
"packet format (pkt-line, see technical/protocol-common.txt) based protocol"

> 
>> +to read a welcome response message ("git-filter-server") and exactly
>> +one protocol version number from the previously sent list. All further
> 
> I guess that is to provide forward-compatibility, isn't it?  Also,
> "Git expects..." probably means filter process MUST send, in the
> RFC2119 (https://tools.ietf.org/html/rfc2119) meaning.

True. I feel "expects" reads better but I am happy to change it if
you feel strong about it.


>> +
>> +After the version negotiation Git sends a list of supported capabilities
>> +and a flush packet.
> 
> Is it that Git SHOULD send list of ALL supported capabilities, or is
> it that Git SHOULD NOT send capabilities it does not support, and that
> it MAY send only those capabilities it needs (so for example if command
> uses only `smudge`, it may not send `clean`, so that filter driver doesn't
> need to initialize data it would not need).

"After the version negotiation Git sends a list of all capabilities that
it supports and a flush packet."

Better?


> I wonder why it is "=true", and not "capability=".
> Is there a case where we would want to send "=false".  Or
> is it to allow configurable / value based capabilities?  Isn't it going
> a bit too far: is there even a hind of an idea for parametrize-able
> capability? YAGNI is a thing...

Peff suggested that format and I think it is OK:
http://public-inbox.org/git/20160803224619.bwtbvmslhuicx...@sigill.intra.peff.net/


> A few new capabilities that we might want to support in the near future
> is "size", "stream", which are options describing how to communicate,
> and "cleanFromFile", "smudgeToFile", which are new types of operations...
> but neither needs any parameter.
> 
> I guess that adding new capabilities doesn't require having to come up
> with the new version of the protocol, isn't it.

Correct.


>> +packet:  git< git-filter-server
>> +packet:  git< version=2
>> +packet:  git> clean=true
>> +packet:  git> smudge=true
>> +packet:  git> not-yet-invented=true
> 
> Hmmm... should we hint at the use of kebab-case versus snake_case
> or camelCase for new capabilities?

I personally prefer kebab-case but I think that is a discussion for
future contributions ;-)


>> +
>> +packet:  git> command=smudge
>> +packet:  git> pathname=path/testfile.dat
>> +packet:  

Re: [PATCH v8 11/11] convert: add filter..process option

2016-09-28 Thread Jakub Narębski
Part third (and last) of the review of v8 11/11.

W dniu 20.09.2016 o 21:02, larsxschnei...@gmail.com napisał:
[...]
> diff --git a/contrib/long-running-filter/example.pl 
> b/contrib/long-running-filter/example.pl
> new file mode 100755
> index 000..c13a631
> --- /dev/null
> +++ b/contrib/long-running-filter/example.pl
[...]
> diff --git a/t/t0021-conversion.sh b/t/t0021-conversion.sh
> index dc50938..210c4f6 100755
> --- a/t/t0021-conversion.sh
> +++ b/t/t0021-conversion.sh

One thing that could have been done as yet another preparatory
patch would be to modernize existing t/t0021-conversion.sh tests.
For example use here-doc instead of series of echo-s, use cp
to copy files and not echo, etc.

> @@ -31,7 +31,10 @@ test_expect_success setup '
>   cat test >test.i &&
>   git add test test.t test.i &&
>   rm -f test test.t test.i &&
> - git checkout -- test test.t test.i
> + git checkout -- test test.t test.i &&
> +
> + echo "content-test2" >test2.o &&
> + echo "content-test3 - subdir" >"test3 - subdir.o"

I see that you prepare here a few uncommitted files, but both
their names and their contents leave much to be desired - you
don't know from the name and contents what they are for.

And the '"subdir"' file which is not in subdirectory is
especially egregious.

>  '
>  
>  script='s/^\$Id: \([0-9a-f]*\) \$/\1/p'
> @@ -279,4 +282,364 @@ test_expect_success 'diff does not reuse worktree files 
> that need cleaning' '
>   test_line_count = 0 count
>  '
>  

A small comment on parameters of this function would be nice.

> +check_filter () {
> + rm -f rot13-filter.log actual.log &&
> + "$@" 2> git_stderr.log &&
> + test_must_be_empty git_stderr.log &&
> + cat >expected.log &&

This is too clever by half.  Having a function that both tests
the behavior and prepares 'expected' file is too much.

In my opinion preparation of 'expected.log' file should be moved
to another function or functions.

Also, if we are running sort on output, I think we should also
run sort on 'expected.log', so that what we write doesn't need to
be created sorted (so we don't have to sort expected lines by hand).
Or maybe we should run the same transformation on rot13-filter.log
and on the contents of expected.log.

> + sort rot13-filter.log | uniq -c | sed "s/^[ ]*//" >actual.log &&
> + test_cmp expected.log actual.log
> +}
> +
> +check_filter_count_clean () {
> + rm -f rot13-filter.log actual.log &&
> + "$@" 2> git_stderr.log &&
> + test_must_be_empty git_stderr.log &&

All those functions (well, wait?) have common setup, which we can
extract into separate shell function, I think.  IMVHO.

> + cat >expected.log &&
> + sort rot13-filter.log | uniq -c | sed "s/^[ ]*//" |
> + sed "s/^\([0-9]\) IN: clean/x IN: clean/" >actual.log &&
> + test_cmp expected.log actual.log
> +}
> +
> +check_filter_ignore_clean () {
> + rm -f rot13-filter.log actual.log &&
> + "$@" &&

Why we don't check for stderr here?

> + cat >expected.log &&
> + grep -v "IN: clean" rot13-filter.log >actual.log &&
> + test_cmp expected.log actual.log
> +}
> +
> +check_filter_no_call () {
> + rm -f rot13-filter.log &&
> + "$@" 2> git_stderr.log &&
> + test_must_be_empty git_stderr.log &&
> + test_must_be_empty rot13-filter.log
> +}
> +

A small comment on parameters of this function would be nice.
And a comment what it does.

> +check_rot13 () {
> + test_cmp "$1" "$2" &&
> + ./../rot13.sh <"$1" >expected &&

Why there is .. in this invocation?

> + git cat-file blob :"$2" >actual &&
> + test_cmp expected actual
> +}
> +
> +test_expect_success PERL 'required process filter should filter data' '
> + test_config_global filter.protocol.process 
> "$TEST_DIRECTORY/t0021/rot13-filter.pl clean smudge" &&
> + test_config_global filter.protocol.required true &&
> + rm -rf repo &&
> + mkdir repo &&
> + (
> + cd repo &&
> + git init &&

Don't you think that creating a fresh test repository for each
separate test is a bit too much?  I guess that you want for
each and every test to be completely independent, but this setup
and teardown is a bit excessive.

Other tests in the same file (should we reuse the test, or use
new test file) do not use this method.

> +
> + echo "*.r filter=protocol" >.gitattributes &&
> + git add . &&
> + git commit . -m "test commit" &&
> + git branch empty &&

Err... I think it would be better to name it 'empty-branch'
(or 'almost-empty-branch', as it does include .gitattributes file).
See my mistake below (marked ...).

> +
> + cp ../test.o test.r &&
> + cp ../test2.o test2.r &&

What does this test2.o / test2.r file tests, that test.o / test.r
doesn't?  The name doesn't tell us.

Why it is test.r, but test2.r?  Why it isn't test1.r?

> + mkdir testsubdir &&
> + cp "../test3 

Re: [PATCH v8 11/11] convert: add filter..process option

2016-09-27 Thread Jakub Narębski
Part second of the review of 11/11.

W dniu 20.09.2016 o 21:02, larsxschnei...@gmail.com pisze:

> diff --git a/contrib/long-running-filter/example.pl 
> b/contrib/long-running-filter/example.pl
> new file mode 100755
> index 000..c13a631
> --- /dev/null
> +++ b/contrib/long-running-filter/example.pl
[...]
> +( packet_txt_read() eq ( 0, "git-filter-client" ) ) || die "bad initialize";
> +( packet_txt_read() eq ( 0, "version=2" ) ) || die "bad version";
> +( packet_bin_read() eq ( 1, "" ) )  || die "bad version end";

What I would like to see here is some kind of packet_read_list()
or packet_txt_read_list() that reads until flush packet or EOF,
and returns list of chomp-ed lines (without LF terminator).

Then you can examine those lines:

   my @lines = packet_read_list();
   $lines[0] eq "git-filter-client"  or die "bad initialization: 
'$lines[0]'";
   grep { $_ eq "version=2" } @lines or die "bad version: version=2 not 
found";

Note: I have not checked that I got operator precedence right.

> +
> +packet_txt_write("git-filter-server");
> +packet_txt_write("version=2");

Here using packet_write_list() would help us to not forget to
send the flush packet:

  packet_write_list(
"git-filter-server",
"version=2"
  );

[...]
> diff --git a/convert.c b/convert.c
> index 597f561..bd66257 100644
> --- a/convert.c
> +++ b/convert.c
> @@ -3,6 +3,7 @@
>  #include "run-command.h"
>  #include "quote.h"
>  #include "sigchain.h"
> +#include "pkt-line.h"
>  
>  /*
>   * convert.c - convert a file when checking it out and checking it in.
> @@ -442,7 +443,7 @@ static int filter_buffer_or_fd(int in, int out, void 
> *data)
>   return (write_err || status);
>  }
>  
> -static int apply_filter(const char *path, const char *src, size_t len, int 
> fd,
> +static int apply_single_file_filter(const char *path, const char *src, 
> size_t len, int fd,
>  struct strbuf *dst, const char *cmd)
>  {
>   /*
> @@ -456,12 +457,6 @@ static int apply_filter(const char *path, const char 
> *src, size_t len, int fd,
>   struct async async;
>   struct filter_params params;
>  
> - if (!cmd || !*cmd)
> - return 0;
> -
> - if (!dst)
> - return 1;
> -
>   memset(, 0, sizeof(async));
>   async.proc = filter_buffer_or_fd;
>   async.data = 

I have reordered a few chunks of this patch to make it easier to
see what happens here, and to review this part of patch.

[moved here from further in the patch]
> +static int apply_filter(const char *path, const char *src, size_t len,
> +int fd, struct strbuf *dst, struct convert_driver 
> *drv,
> +const unsigned int wanted_capability)
> +{
> + const char *cmd = NULL;
> +
> + if (!drv)
> + return 0;
> +
> + if (!dst)
> + return 1;

To reduce the size of this patch (which is not yet of the size that
would make vger reject the email :-/), perhaps this split into 
apply_single_file_filter() and apply_filter(), without yet adding
apply_multi_file_filter(). The apply_filter() would be semi-simple
wrapper, with the same signature as above.

> +
> + if (!drv->process && (CAP_CLEAN & wanted_capability) && drv->clean)

This is just a very minor nitpicking, but wouldn't it be easier
to read with those checks reordered?

  + if ((wanted_capability & CAP_CLEAN) && !drv->process && drv->clean)

That is: if we want `clean` capability, and `process` is not set,
and there is `clean` filter.

Though the following would also work...

  + if ((wanted_capability & CAP_CLEAN) && !drv->process)

> + cmd = drv->clean;
> + else if (!drv->process && (CAP_SMUDGE & wanted_capability) && 
> drv->smudge)
> + cmd = drv->smudge;
> +
> + if (cmd && *cmd)

... thanks to this check (the 'cmd' part, which we need to check anyway).

'cmd = drv->clean', then 'if (cmd)' is the same as 'if (drv->clean)',
then 'cmd = drv->clean', then 'if (cmd)', isn't it.

Not sure if it would be more readable, or less readable.

CAP_CLEAN and CAP_SMUDGE are, in theory, mutually exclusive.  Note that
the above order prefers `smudge` to `clean` if both given, while in other
places we prefer `clean` to `smudge` if both given.


> + return apply_single_file_filter(path, src, len, fd, dst, cmd);
> + else if (drv->process && *drv->process)
> + return apply_multi_file_filter(path, src, len, fd, dst, 
> drv->process, wanted_capability);
> +
> + return 0;
> +}

Nice and clean wrapper.

> +

[moved here from further in the patch]
> @@ -839,7 +1140,7 @@ int would_convert_to_git_filter_fd(const char *path)
>   if (!ca.drv->required)
>   return 0;
>  
> - return apply_filter(path, NULL, 0, -1, NULL, ca.drv->clean);
> + return apply_filter(path, NULL, 0, -1, NULL, ca.drv, CAP_CLEAN);
>  }
>  
>  const char *get_convert_attr_ascii(const char *path)

This would 

Re: [PATCH v8 11/11] convert: add filter..process option

2016-09-26 Thread Jakub Narębski
Part first of the review of 11/11.

W dniu 20.09.2016 o 21:02, larsxschnei...@gmail.com pisze:
> From: Lars Schneider 
> 
> Git's clean/smudge mechanism invokes an external filter process for
> every single blob that is affected by a filter. If Git filters a lot of
> blobs then the startup time of the external filter processes can become
> a significant part of the overall Git execution time.
> 
> In a preliminary performance test this developer used a clean/smudge
> filter written in golang to filter 12,000 files. This process took 364s
> with the existing filter mechanism and 5s with the new mechanism. See
> details here: https://github.com/github/git-lfs/pull/1382
> 
> This patch adds the `filter..process` string option which, if
> used, keeps the external filter process running and processes all blobs
> with the packet format (pkt-line) based protocol over standard input and
> standard output. The full protocol is explained in detail in
> `Documentation/gitattributes.txt`.

That is a good description.  Enough detail to explain the new feature,
all without duplicating information with (added) docs.

> 
> A few key decisions:
> 
> * The long running filter process is referred to as filter protocol
>   version 2 because the existing single shot filter invocation is
>   considered version 1.

All right.

> * Git sends a welcome message and expects a response right after the
>   external filter process has started. This ensures that Git will not
>   hang if a version 1 filter is incorrectly used with the
>   filter..process option for version 2 filters. In addition,
>   Git can detect this kind of error and warn the user.

On one hand side, this involved handshake means that implementing
a filter process script is harder; you need to write quite a lot of
boilerplate (though the example or examples would help).

On the other hand, this handshake is what allows good error detection,
easy extendability of the protocol, and forward-compatibility.  Which,
as we agreed (AFAIU), is more important.

> * The status of a filter operation (e.g. "success" or "error) is set
>   before the actual response and (if necessary!) re-set after the
>   response. The advantage of this two step status response is that if
>   the filter detects an error early, then the filter can communicate
>   this and Git does not even need to create structures to read the
>   response.

That's nice (well, among others I have argued for this :-))

> * All status responses are pkt-line lists terminated with a flush
>   packet. This allows us to send other status fields with the same
>   protocol in the future.

Good.

This also makes protocol simple, easier to implement (on Git side),
and easier to parse (on filter side).

> 
> Helped-by: Martin-Louis Bright 
> Reviewed-by: Jakub Narebski 
> Signed-off-by: Lars Schneider 
> ---
>  Documentation/gitattributes.txt| 156 +-
>  contrib/long-running-filter/example.pl | 123 +++
>  convert.c  | 348 ---
>  pkt-line.h |   1 +
>  t/t0021-conversion.sh  | 365 
> -
>  t/t0021/rot13-filter.pl| 191 +
>  6 files changed, 1153 insertions(+), 31 deletions(-)

That's quite a large change.  Large changes are harder to review.
I was thinking about how one could split this change.  I guess
that it is better to keep the new feature, its documentation, and
its tests together.  But perhaps the example in `contrib/`
(which is newer, and thus less reviewed) would be better in
separate commit.

There is also another change that could be split off this patch
into purely preparatory commit, that is one that stands alone
but doesn't make much sense alone.  I would write about this
proposal (important only if there would be yet another iteration
of this patch series) a bit later.

>  create mode 100755 contrib/long-running-filter/example.pl
>  create mode 100755 t/t0021/rot13-filter.pl
> 
> diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
> index 7aff940..946dcad 100644
> --- a/Documentation/gitattributes.txt
> +++ b/Documentation/gitattributes.txt
> @@ -293,7 +293,13 @@ checkout, when the `smudge` command is specified, the 
> command is
>  fed the blob object from its standard input, and its standard
>  output is used to update the worktree file.  Similarly, the
>  `clean` command is used to convert the contents of worktree file
> -upon checkin.
> +upon checkin. By default these commands process only a single
> +blob and terminate.  If a long running `process` filter is used
   

Should we use this terminology here?  I have not read the preceding
part of documentation, so I don't know if it talks about "blobs" or
if it uses "files" and/or "file contents".

Though this is very minor nitpick.

> +in place of `clean` 

[PATCH v8 11/11] convert: add filter..process option

2016-09-20 Thread larsxschneider
From: Lars Schneider 

Git's clean/smudge mechanism invokes an external filter process for
every single blob that is affected by a filter. If Git filters a lot of
blobs then the startup time of the external filter processes can become
a significant part of the overall Git execution time.

In a preliminary performance test this developer used a clean/smudge
filter written in golang to filter 12,000 files. This process took 364s
with the existing filter mechanism and 5s with the new mechanism. See
details here: https://github.com/github/git-lfs/pull/1382

This patch adds the `filter..process` string option which, if
used, keeps the external filter process running and processes all blobs
with the packet format (pkt-line) based protocol over standard input and
standard output. The full protocol is explained in detail in
`Documentation/gitattributes.txt`.

A few key decisions:

* The long running filter process is referred to as filter protocol
  version 2 because the existing single shot filter invocation is
  considered version 1.
* Git sends a welcome message and expects a response right after the
  external filter process has started. This ensures that Git will not
  hang if a version 1 filter is incorrectly used with the
  filter..process option for version 2 filters. In addition,
  Git can detect this kind of error and warn the user.
* The status of a filter operation (e.g. "success" or "error) is set
  before the actual response and (if necessary!) re-set after the
  response. The advantage of this two step status response is that if
  the filter detects an error early, then the filter can communicate
  this and Git does not even need to create structures to read the
  response.
* All status responses are pkt-line lists terminated with a flush
  packet. This allows us to send other status fields with the same
  protocol in the future.

Helped-by: Martin-Louis Bright 
Reviewed-by: Jakub Narebski 
Signed-off-by: Lars Schneider 
---
 Documentation/gitattributes.txt| 156 +-
 contrib/long-running-filter/example.pl | 123 +++
 convert.c  | 348 ---
 pkt-line.h |   1 +
 t/t0021-conversion.sh  | 365 -
 t/t0021/rot13-filter.pl| 191 +
 6 files changed, 1153 insertions(+), 31 deletions(-)
 create mode 100755 contrib/long-running-filter/example.pl
 create mode 100755 t/t0021/rot13-filter.pl

diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index 7aff940..946dcad 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -293,7 +293,13 @@ checkout, when the `smudge` command is specified, the 
command is
 fed the blob object from its standard input, and its standard
 output is used to update the worktree file.  Similarly, the
 `clean` command is used to convert the contents of worktree file
-upon checkin.
+upon checkin. By default these commands process only a single
+blob and terminate.  If a long running `process` filter is used
+in place of `clean` and/or `smudge` filters, then Git can process
+all blobs with a single filter command invocation for the entire
+life of a single Git command, for example `git add --all`.  See
+section below for the description of the protocol used to
+communicate with a `process` filter.
 
 One use of the content filtering is to massage the content into a shape
 that is more convenient for the platform, filesystem, and the user to use.
@@ -373,6 +379,154 @@ not exist, or may have different contents. So, smudge and 
clean commands
 should not try to access the file on disk, but only act as filters on the
 content provided to them on standard input.
 
+Long Running Filter Process
+^^^
+
+If the filter command (a string value) is defined via
+`filter..process` then Git can process all blobs with a
+single filter invocation for the entire life of a single Git
+command. This is achieved by using a packet format (pkt-line,
+see technical/protocol-common.txt) based protocol over standard
+input and standard output as follows. All packets are considered
+text and therefore are terminated by an LF. Exceptions are the
+"*CONTENT" packets and the flush packet.
+
+Git starts the filter when it encounters the first file
+that needs to be cleaned or smudged. After the filter started
+Git sends a welcome message ("git-filter-client"), a list of
+supported protocol version numbers, and a flush packet. Git expects
+to read a welcome response message ("git-filter-server") and exactly
+one protocol version number from the previously sent list. All further
+communication will be based on the selected version. The remaining
+protocol description below documents "version=2". Please note that
+"version=42" in the example below does not exist and is only