Re: [PATCH v3 4/4] convert: add "status=delayed" to filter process protocol

2017-05-21 Thread Lars Schneider

> On 19 Apr 2017, at 20:55, Torsten Bögershausen  wrote:
> 
> 
>>> (Back to the roots)
>>> Which criteria do you have in mind: When should a filter process the blob
>>> and return it immediately, and when would it respond "delayed" ?
>> 
>> See above: it's up to the filter. In case of Git LFS: delay if a network 
>> call is required.
>> 
> That make sense.
> I try to understand the big picture, and from here try to review
> the details.
> Does it make sense to mention "git lfs" in the commit message,
> and/or add some test code ?

I'll mention Git LFS in the commit message. The test code 
(t/t0021/rot13-filter.pl)
should mimic the behavior of Git LFS already.

Thanks,
Lars

Re: [PATCH v3 4/4] convert: add "status=delayed" to filter process protocol

2017-04-19 Thread Torsten Bögershausen

>> (Back to the roots)
>> Which criteria do you have in mind: When should a filter process the blob
>> and return it immediately, and when would it respond "delayed" ?
> 
> See above: it's up to the filter. In case of Git LFS: delay if a network call 
> is required.
> 
That make sense.
I try to understand the big picture, and from here try to review
the details.
Does it make sense to mention "git lfs" in the commit message,
and/or add some test code ?




Re: [PATCH v3 4/4] convert: add "status=delayed" to filter process protocol

2017-04-18 Thread Taylor Blau
On Tue, Apr 18, 2017 at 06:14:36PM +0200, Lars Schneider wrote:
> > Both Git and the filter are going to have to keep these paths in memory
> > somewhere, be that in-process, or on disk. That being said, I can see 
> > potential
> > troubles with a large number of long paths that exceed the memory available 
> > to
> > Git or the filter when stored in a hashmap/set.
> >
> > On Git's side, I think trading that for some CPU time might make sense. If 
> > Git
> > were to SHA1 each path and store that in a hashmap, it would consume more 
> > CPU
> > time, but less memory to store each path. Git and the filter could then 
> > exchange
> > path names, and Git would simply SHA1 the pathname each time it needed to 
> > refer
> > back to memory associated with that entry in a hashmap.
>
> I would be surprised if this would be necessary. If we filter delay 50,000
> files (= a lot!) with a path length of 1000 characters (= very long!) then we
> would use 50MB plus some hashmap data structures. Modern machines should have
> enough RAM I would think...

I agree, and thanks for correcting my thinking here. I ran a simple command to
get the longest path names in a large repository, as:

  $ find . -type f | awk '{ print length($1) }' | sort -r -n | uniq -c

And found a few files close to the 200 character mark as the longest pathnames
in the repository. I think 50k files at 1k bytes per pathname is quite enough
head-room :-).


--
Thanks,
Taylor Blau


Re: [PATCH v3 4/4] convert: add "status=delayed" to filter process protocol

2017-04-18 Thread Lars Schneider

> On 12. Apr 2017, at 19:46, Taylor Blau  wrote:
> 
> I think this is a great approach and one that I'd be happy to implement in 
> LFS.
> The additional capability isn't too complex, so I think other similar filters 
> to
> LFS shouldn't have a hard time implementing it either.
> 
> I left a few comments, mostly expressing approval to the documentation 
> changes.
> I'll leave the C review to someone more expert than me.
> 
> +1 from me on the protocol changes.

Thanks!


>> +Delay
>> +^
>> +
>> +If the filter supports the "delay" capability, then Git can send the
>> +flag "delay-able" after the filter command and pathname.
> 
> Nit: I think either way is fine, but `can_delay` will save us 1 byte per each
> new checkout entry.

1 byte is no convincing argument to me but since you are a native speaker I 
trust your "can-delay" suggestion. I prefer dashes over underscores, though, 
for consistency with the rest of the protocol.


>> +"delay-id", a number that identifies the blob, and a flush packet. The
>> +filter acknowledges this number with a "success" status and a flush
>> +packet.
> 
> I mentioned this in another thread, but I'd prefer, if possible, that we use 
> the
> pathname as a unique identifier for referring back to a particular checkout
> entry. I think adding an additional identifier adds unnecessary complication 
> to
> the protocol and introduces a forced mapping on the filter side from id to
> path.

I agree! I answered in the other thread. Let's keep the discussion there.


> Both Git and the filter are going to have to keep these paths in memory
> somewhere, be that in-process, or on disk. That being said, I can see 
> potential
> troubles with a large number of long paths that exceed the memory available to
> Git or the filter when stored in a hashmap/set.
> 
> On Git's side, I think trading that for some CPU time might make sense. If Git
> were to SHA1 each path and store that in a hashmap, it would consume more CPU
> time, but less memory to store each path. Git and the filter could then 
> exchange
> path names, and Git would simply SHA1 the pathname each time it needed to 
> refer
> back to memory associated with that entry in a hashmap.

I would be surprised if this would be necessary. If we filter delay 50,000 
files (= a lot!) with a path length of 1000 characters (= very long!) then we 
would use 50MB plus some hashmap data structures. Modern machines should have 
enough RAM I would think...

Thanks,
Lars

Re: [PATCH v3 4/4] convert: add "status=delayed" to filter process protocol

2017-04-18 Thread Lars Schneider

> On 12. Apr 2017, at 06:37, Torsten Bögershausen  wrote:
> 
> On 2017-04-11 21:50, Lars Schneider wrote:
> 
> []
>>> packet:  git> command=smudge
>>> packet:  git> pathname=path/testfile.dat
>>> packet:  git> delay-id=1
>>> packet:  git> 
>>> packet:  git> CONTENT
>>> packet:  git> 
>>> packet:  git< status=delayed # this means: Git, please feed more
>>> packet:  git> 
>> Actually, this is how I implemented it first.
>> 
>> However, I didn't like that because we associate a
>> pathname with a delay-id. If the filter does not
>> delay the item then we associate a different
>> pathname with the same delay-id in the next request. 
>> Therefore I think it is better to present the delay-id 
>> *only* to the filter if the item is actually delayed.
>> 
>> I would be surprised if the extra round trip does impact
>> the performance in any meaningful way.
>> 
> 
> 2 spontanous remarks:
> 
> - Git can simply use a counter which is incremented by each blob
>  that is send to the filter.
>  Regardless what the filter answers (delayed or not), simply increment a
>  counter. (or is this too simple and I miss something?)

I am not sure I understand what you mean. Do you want to say that the filter 
just starts to delay items with an increasing counter by itself (first item = 
0, second item = 1, ...). That could work but I would prefer a more explicitly 
defined solution. The self counting implicit way could easily get 
confused/mixed up I think.


> - I was thinking that the filter code is written as either "never delay" or
>  "always delay".
>  "Never delay" is the existing code.
>  What is your idea, when should a filter respond with delayed ?
>  My thinking was "always", silently assuming the more than one core can be
>  used, so that blobs can be filtered in parallel.

In case of Git LFS I expect the filter to answer with "delay" if a network call 
is required to fulfill the (smudge-) filter request.


>> We could do this but I think this would only complicate
>> the protocol. I expect the filter to spool results to the
>> disk or something.
>  Spooling things to disk was not part of my picture, to be honest.
>  This means additional execution time when a SSD is used, the chips
>  are more worn out...
>  There may be situations, where this is OK for some users (but not for others)
>  How can we prevent Git from (over-) flooding the filter?

I don't think this is Git's responsibility. If the filter can't handle the data 
then I expect the filter to *not* ask Git for a delay.


>  The protocol response from the filter would be just "delayed", and the filter
>  would block Git, right ?
>  But, in any case, it would still be nice if Git would collect converted blobs
>  from the filter, to free resource here.
>  This is more like the "streaming model", but on a higher level:
>  Send 4 blobs to the filter, collect the ready one, send the 5th blob to
>  the filter, collect the ready one, send the 6th blob to the filter, collect
>  ready one

That sounds nice, but I don't think that it is necessary in this (first) 
iteration. I think it is acceptable for the filter to spool data to disk.


> (Back to the roots)
> Which criteria do you have in mind: When should a filter process the blob
> and return it immediately, and when would it respond "delayed" ?

See above: it's up to the filter. In case of Git LFS: delay if a network call 
is required.

Thanks,
Lars

Re: [PATCH v3 4/4] convert: add "status=delayed" to filter process protocol

2017-04-12 Thread Taylor Blau
I think this is a great approach and one that I'd be happy to implement in LFS.
The additional capability isn't too complex, so I think other similar filters to
LFS shouldn't have a hard time implementing it either.

I left a few comments, mostly expressing approval to the documentation changes.
I'll leave the C review to someone more expert than me.

+1 from me on the protocol changes.

> +Delay
> +^
> +
> +If the filter supports the "delay" capability, then Git can send the
> +flag "delay-able" after the filter command and pathname.

Nit: I think either way is fine, but `can_delay` will save us 1 byte per each
new checkout entry.

> +"delay-id", a number that identifies the blob, and a flush packet. The
> +filter acknowledges this number with a "success" status and a flush
> +packet.

I mentioned this in another thread, but I'd prefer, if possible, that we use the
pathname as a unique identifier for referring back to a particular checkout
entry. I think adding an additional identifier adds unnecessary complication to
the protocol and introduces a forced mapping on the filter side from id to
path.

Both Git and the filter are going to have to keep these paths in memory
somewhere, be that in-process, or on disk. That being said, I can see potential
troubles with a large number of long paths that exceed the memory available to
Git or the filter when stored in a hashmap/set.

On Git's side, I think trading that for some CPU time might make sense. If Git
were to SHA1 each path and store that in a hashmap, it would consume more CPU
time, but less memory to store each path. Git and the filter could then exchange
path names, and Git would simply SHA1 the pathname each time it needed to refer
back to memory associated with that entry in a hashmap.

> +by a "success" status that is also terminated with a flush packet. If
> +no blobs for the delayed paths are available, yet, then the filter is
> +expected to block the response until at least one blob becomes
> +available. The filter can tell Git that it has no more delayed blobs
> +by sending an empty list.

I think the blocking nature of the `list_available_blobs` command is a great
synchronization mechanism for telling the filter when it can and can't send new
blob information to Git. I was tenatively thinking of suggesting to remove this
command and instead allow the filter to send readied blobs in sequence after all
unique checkout entries had been sent from Git to the filter. But I think this
allows approach allows us more flexibility, and isn't that much extra
complication or bytes across the pipe.


--
Thanks,
Taylor Blau


Re: [PATCH v3 4/4] convert: add "status=delayed" to filter process protocol

2017-04-12 Thread Taylor Blau
> > (And at this point, may I suggest to change "delay-id" into "request-id=1" ?
>
> If there is no objection by another reviewer then I am happy to change it.

I think "delay-id" may be more illustrative of what's occurring in this request.
That being said, my preference would be that we remove the
"delay-id"/"request-id" entirely from the protocol, and make Git responsible for
handling the path lookup by a hashmap.

Is the concern that a hashmap covering all entries in a large checkout would be
too large to keep in memory? If so, keeping an opaque ID as a part of the
protocol is something I would not object to.

> >> +packet:  git> 
> >> +packet:  git>   # empty content!
> >> +packet:  git< status=success
> >> +packet:  git< 
> >> +packet:  git< SMUDGED_CONTENT
> >> +packet:  git< 
> >> +packet:  git< 
> >
> > OK, good.
> >
> > The quest is: what happens next ?
> >
> > 2 things, kind of in parallel, but we need to prioritize and serialize:
> > - Send the next blob
> > - Fetch ready blobs
> > - And of course: ask for more ready blobs.
> > (it looks as if Peff and Jakub had useful comments already,
> >  so I can stop here?)
>
> I would like to keep the mechanism as follows:
>
> 1. sends all blobs to the filter
> 2. fetch blobs until we are done
>
> @Taylor: Do you think that would be OK for LFS?

I think that this would be fine for LFS and filters of this kind in general. For
LFS in particular, my initial inclination would be to have the protocol open
support writing blob data back to Git at anytime during the checkout process,
not just after all blobs have been sent to the filter.

That being said, I don't think this holds up in practice. The blobs are too big
to fit in memory anyway, and will just end up getting written to LFS's object
cache in .git/lfs/objects.

Since they're already in there, all we would have to do is keep the list of
`readyIds map[int]*os.File` in memory (or even map int -> LFS OID, and open the
file later), and then `io.Copy()` from the open file back to Git.

This makes me think of adding another capability to the protocol, which would
just be exchanging paths on disk in `/tmp` or any other directory so that we
wouldn't have to stream content over the pipe. Instead of responding with

packet:  git< status=success
packet:  git< 
packet:  git< SMUDGED_CONTENT
packet:  git< 
packet:  git< 

We could respond with:

packet:  git< status=success
packet:  git< 
packet:  git< /path/to/contents.dat # <-
packet:  git< 
packet:  git< 

Git would then be responsible for opening that file on disk (the filter would
guarantee that to be possible), and then copying its contents into the working
tree.

I think that's a topic for later discussion, though :-).

> > In general, Git should not have a unlimited number of blobs outstanding,
> > as memory constraints may apply.
> > There may be a config variable for the number of outstanding blobs,
> > (similar to the window size in other protocols) and a variable
> > for the number of "send bytes in outstanding blobs"
> > (similar to window size (again!) in e.g TCP)
> >
> > The number of outstanding blobs is may be less important, and it is more
> > important to monitor the number of bytes we keep in memory in some way.
> >
> > Something like "we set a limit to 500K of out standng data", once we are
> > above the limit, don't send any new blobs.
>
> I don't expect the filter to keep everything in memory. If there is no memory
> anymore then I expect the filter to spool to disk. This keeps the protocol 
> simple.
> If this turns out to be not sufficient then we could improve that later, too.

Agree.


--
Thanks,
Taylor Blau


Re: [PATCH v3 4/4] convert: add "status=delayed" to filter process protocol

2017-04-11 Thread Torsten Bögershausen
On 2017-04-11 21:50, Lars Schneider wrote:

[]
>> packet:  git> command=smudge
>> packet:  git> pathname=path/testfile.dat
>> packet:  git> delay-id=1
>> packet:  git> 
>> packet:  git> CONTENT
>> packet:  git> 
>> packet:  git< status=delayed # this means: Git, please feed more
>> packet:  git> 
> Actually, this is how I implemented it first.
> 
> However, I didn't like that because we associate a
> pathname with a delay-id. If the filter does not
> delay the item then we associate a different
> pathname with the same delay-id in the next request. 
> Therefore I think it is better to present the delay-id 
> *only* to the filter if the item is actually delayed.
> 
> I would be surprised if the extra round trip does impact
> the performance in any meaningful way.
> 

2 spontanous remarks:

- Git can simply use a counter which is incremented by each blob
  that is send to the filter.
  Regardless what the filter answers (delayed or not), simply increment a
  counter. (or is this too simple and I miss something?)

- I was thinking that the filter code is written as either "never delay" or
  "always delay".
  "Never delay" is the existing code.
  What is your idea, when should a filter respond with delayed ?
  My thinking was "always", silently assuming the more than one core can be
  used, so that blobs can be filtered in parallel.

>We could do this but I think this would only complicate
>the protocol. I expect the filter to spool results to the
>disk or something.
  Spooling things to disk was not part of my picture, to be honest.
  This means additional execution time when a SSD is used, the chips
  are more worn out...
  There may be situations, where this is OK for some users (but not for others)
  How can we prevent Git from (over-) flooding the filter?
  The protocol response from the filter would be just "delayed", and the filter
  would block Git, right ?
  But, in any case, it would still be nice if Git would collect converted blobs
  from the filter, to free resource here.
  This is more like the "streaming model", but on a higher level:
  Send 4 blobs to the filter, collect the ready one, send the 5th blob to
  the filter, collect the ready one, send the 6th blob to the filter, collect
  ready one


(Back to the roots)
Which criteria do you have in mind: When should a filter process the blob
and return it immediately, and when would it respond "delayed" ?






Re: [PATCH v3 4/4] convert: add "status=delayed" to filter process protocol

2017-04-11 Thread Lars Schneider

> On 10 Apr 2017, at 22:54, Torsten Bögershausen  wrote:
> 
> On 2017-04-09 21:11, Lars Schneider wrote:
> []
>> +
>> +packet:  git> command=smudge
>> +packet:  git> pathname=path/testfile.dat
>> +packet:  git> delay-able=1
>> +packet:  git> 
>> +packet:  git> CONTENT
>> +packet:  git> 
>> +packet:  git< status=delayed
>> +packet:  git< 
>> +packet:  git> delay-id=1
>> +packet:  git> 
>> +packet:  git< status=success
>> +packet:  git< 
> 
> (not sure if this was mentioned before)
> If a filter uses the delayed feature, I would read it as
> a response from the filter in the style:
> "Hallo Git, I need some time to process it, but as I have
> CPU capacity available, please send another blob,
> so I can chew them in parallel."
> 
> Can we save one round trip ?
> 
> packet:  git> command=smudge
> packet:  git> pathname=path/testfile.dat
> packet:  git> delay-id=1
> packet:  git> 
> packet:  git> CONTENT
> packet:  git> 
> packet:  git< status=delayed # this means: Git, please feed more
> packet:  git> 

Actually, this is how I implemented it first.

However, I didn't like that because we associate a
pathname with a delay-id. If the filter does not
delay the item then we associate a different
pathname with the same delay-id in the next request. 
Therefore I think it is better to present the delay-id 
*only* to the filter if the item is actually delayed.

I would be surprised if the extra round trip does impact
the performance in any meaningful way.


> # Git feeds the next blob.
> # This may be repeated some rounds.
> # (We may want to restrict the number of rounds for Git, see below)
> # After these some rounds, the filter needs to signal:
> # no more fresh blobs please, collect some data and I can free memory
> # and after that I am able to get a fresh blob.
> packet:  git> command=smudge
> packet:  git> pathname=path/testfile.dat
> packet:  git> delay-id=2
> packet:  git> 
> packet:  git> CONTENT
> packet:  git> 
> packet:  git< status=pleaseWait
> packet:  git> 
> 
> # Now Git needs to ask for ready blobs.

We could do this but I think this would only complicate
the protocol. I expect the filter to spool results to the
disk or something.


>> +
>> +
>> +If the filter supports the "delay" capability then it must support the
>> +"list_available_blobs" command. If Git sends this command, then the
>> +filter is expected to return a list of "delay_ids" of blobs that are
>> +available. The list must be terminated with a flush packet followed
>> +by a "success" status that is also terminated with a flush packet. If
>> +no blobs for the delayed paths are available, yet, then the filter is
>> +expected to block the response until at least one blob becomes
>> +available. The filter can tell Git that it has no more delayed blobs
>> +by sending an empty list.
>> +
>> +packet:  git> command=list_available_blobs
>> +packet:  git> 
>> +packet:  git< 7
> 
> Is the "7" the same as the "delay-id=1" from above?

Yes! Sorry, I will make this more clear in the next round.


> It may be easier to understand, even if it costs some bytes, to answer instead
> packet:  git< delay-id=1

Agreed!


> (And at this point, may I suggest to change "delay-id" into "request-id=1" ?

If there is no objection by another reviewer then I am happy to change it.


>> +packet:  git< 13
> 
> Same question here: is this the delay-id ?

Yes.


>> +packet:  git< 
>> +packet:  git< status=success
>> +packet:  git< 
>> +
>> +
>> +After Git received the "delay_ids", it will request the corresponding
>> +blobs again. These requests contain a "delay-id" and an empty content
>> +section. The filter is expected to respond with the smudged content
>> +in the usual way as explained above.
>> +
>> +packet:  git> command=smudge
>> +packet:  git> pathname=test-delay10.a
>> +packet:  git> delay-id=0
> 
> Minor question: Where does the "id=0" come from ?

That's the delay-id (aka request-id) that Git gave to the filter
on the first request (which was delayed).


>> +packet:  git> 
>> +packet:  git>   # empty content!
>> +packet:  git< status=success
>> +packet:  git< 
>> +packet:  git< SMUDGED_CONTENT
>> +packet:  git< 
>> +packet:  git< 
> 
> OK, good.
> 
> The quest is: what happens next ?
> 
> 2 things, kind of in parallel, but we need to prioritize and serialize:
> - Send the next blob
> - Fetch ready blobs
> - And of course: ask for more ready blobs.
> (it looks as if Peff and Jakub had 

Re: [PATCH v3 4/4] convert: add "status=delayed" to filter process protocol

2017-04-10 Thread Torsten Bögershausen
On 2017-04-09 21:11, Lars Schneider wrote:
[]
> +
> +packet:  git> command=smudge
> +packet:  git> pathname=path/testfile.dat
> +packet:  git> delay-able=1
> +packet:  git> 
> +packet:  git> CONTENT
> +packet:  git> 
> +packet:  git< status=delayed
> +packet:  git< 
> +packet:  git> delay-id=1
> +packet:  git> 
> +packet:  git< status=success
> +packet:  git< 

(not sure if this was mentioned before)
If a filter uses the delayed feature, I would read it as
a response from the filter in the style:
"Hallo Git, I need some time to process it, but as I have
CPU capacity available, please send another blob,
so I can chew them in parallel."

Can we save one round trip ?

packet:  git> command=smudge
packet:  git> pathname=path/testfile.dat
packet:  git> delay-id=1
packet:  git> 
packet:  git> CONTENT
packet:  git> 
packet:  git< status=delayed # this means: Git, please feed more
packet:  git> 

# Git feeds the next blob.
# This may be repeated some rounds.
# (We may want to restrict the number of rounds for Git, see below)
# After these some rounds, the filter needs to signal:
# no more fresh blobs please, collect some data and I can free memory
# and after that I am able to get a fresh blob.
packet:  git> command=smudge
packet:  git> pathname=path/testfile.dat
packet:  git> delay-id=2
packet:  git> 
packet:  git> CONTENT
packet:  git> 
packet:  git< status=pleaseWait
packet:  git> 

# Now Git needs to ask for ready blobs.

> +
> +
> +If the filter supports the "delay" capability then it must support the
> +"list_available_blobs" command. If Git sends this command, then the
> +filter is expected to return a list of "delay_ids" of blobs that are
> +available. The list must be terminated with a flush packet followed
> +by a "success" status that is also terminated with a flush packet. If
> +no blobs for the delayed paths are available, yet, then the filter is
> +expected to block the response until at least one blob becomes
> +available. The filter can tell Git that it has no more delayed blobs
> +by sending an empty list.
> +
> +packet:  git> command=list_available_blobs
> +packet:  git> 
> +packet:  git< 7

Is the "7" the same as the "delay-id=1" from above?
It may be easier to understand, even if it costs some bytes, to answer instead
packet:  git< delay-id=1
(And at this point, may I suggest to change "delay-id" into "request-id=1" ?

> +packet:  git< 13

Same question here: is this the delay-id ?

> +packet:  git< 
> +packet:  git< status=success
> +packet:  git< 
> +
> +
> +After Git received the "delay_ids", it will request the corresponding
> +blobs again. These requests contain a "delay-id" and an empty content
> +section. The filter is expected to respond with the smudged content
> +in the usual way as explained above.
> +
> +packet:  git> command=smudge
> +packet:  git> pathname=test-delay10.a
> +packet:  git> delay-id=0

Minor question: Where does the "id=0" come from ?

> +packet:  git> 
> +packet:  git>   # empty content!
> +packet:  git< status=success
> +packet:  git< 
> +packet:  git< SMUDGED_CONTENT
> +packet:  git< 
> +packet:  git< 

OK, good.

The quest is: what happens next ?

2 things, kind of in parallel, but we need to prioritize and serialize:
- Send the next blob
- Fetch ready blobs
- And of course: ask for more ready blobs.
(it looks as if Peff and Jakub had useful comments already,
  so I can stop here?)


In general, Git should not have a unlimited number of blobs outstanding,
as memory constraints may apply.
There may be a config variable for the number of outstanding blobs,
(similar to the window size in other protocols) and a variable
for the number of "send bytes in outstanding blobs"
(similar to window size (again!) in e.g TCP)

The number of outstanding blobs is may be less important, and it is more
important to monitor the number of bytes we keep in memory in some way.

Something like "we set a limit to 500K of out standng data", once we are
above the limit, don't send any new blobs.



(No, I didn't look at the code at all, this protocol discussion
is much more interesting)




Re: [PATCH v3 4/4] convert: add "status=delayed" to filter process protocol

2017-04-10 Thread Lars Schneider

> On 10 Apr 2017, at 16:28, Eric Wong  wrote:
> 
> Lars Schneider  wrote:
>>> diff --git a/convert.h b/convert.h
>>> index 82871a11d5..da6c702090 100644
>>> --- a/convert.h
>>> +++ b/convert.h
>>> @@ -42,6 +42,11 @@ extern int convert_to_git(const char *path, const char 
>>> *src, size_t len,
>>>   struct strbuf *dst, enum safe_crlf checksafe);
>>> extern int convert_to_working_tree(const char *path, const char *src,
>>>size_t len, struct strbuf *dst);
>>> +extern int async_convert_to_working_tree(const char *path, const char *src,
>>> +size_t len, struct strbuf *dst,
>>> +void *dco);
>>> 
>> 
>> I don't like the void pointer here. However, "cache.h" includes "convert.h" 
>> and
>> therefore "convert.h" cannot include "cache.h". That's why "convert.h" 
>> doesn't
>> know about "struct delayed_checkout". 
> 
> You can forward declare the struct without fields in convert.h:

OMG. Of course. Now I feel stupid.

>> I just realized that I could move "struct delayed_checkout" and "enum 
>> ce_delay_state"
>> definition from "cache.h" to "convert.h" to solve the problem nicely.
>> 
> 
> But yeah, maybe you can reduce cache.h size, too :)

Yeah, then I will do this in the next round.

Thanks,
Lars

Re: [PATCH v3 4/4] convert: add "status=delayed" to filter process protocol

2017-04-10 Thread Eric Wong
Lars Schneider  wrote:
> > diff --git a/convert.h b/convert.h
> > index 82871a11d5..da6c702090 100644
> > --- a/convert.h
> > +++ b/convert.h
> > @@ -42,6 +42,11 @@ extern int convert_to_git(const char *path, const char 
> > *src, size_t len,
> >   struct strbuf *dst, enum safe_crlf checksafe);
> > extern int convert_to_working_tree(const char *path, const char *src,
> >size_t len, struct strbuf *dst);
> > +extern int async_convert_to_working_tree(const char *path, const char *src,
> > +size_t len, struct strbuf *dst,
> > +void *dco);
> > 
> 
> I don't like the void pointer here. However, "cache.h" includes "convert.h" 
> and
> therefore "convert.h" cannot include "cache.h". That's why "convert.h" doesn't
> know about "struct delayed_checkout". 

You can forward declare the struct without fields in convert.h:

diff --git a/convert.h b/convert.h
index da6c702090..3fb6b420b2 100644
--- a/convert.h
+++ b/convert.h
@@ -32,6 +32,8 @@ enum eol {
 #endif
 };
 
+struct delayed_checkout;
+
 extern enum eol core_eol;
 extern const char *get_cached_convert_stats_ascii(const char *path);
 extern const char *get_wt_convert_stats_ascii(const char *path);
@@ -44,7 +46,7 @@ extern int convert_to_working_tree(const char *path, const 
char *src,
   size_t len, struct strbuf *dst);
 extern int async_convert_to_working_tree(const char *path, const char *src,
 size_t len, struct strbuf *dst,
-void *dco);
+struct delayed_checkout *dco);
 extern int async_query_available_blobs(const char *cmd, unsigned long 
**delay_ids,
   int *delay_ids_nr);
 extern int renormalize_buffer(const char *path, const char *src, size_t len,

> 
> I just realized that I could move "struct delayed_checkout" and "enum 
> ce_delay_state"
> definition from "cache.h" to "convert.h" to solve the problem nicely.
> 

But yeah, maybe you can reduce cache.h size, too :)

> Any objection to this approach?
> 


Re: [PATCH v3 4/4] convert: add "status=delayed" to filter process protocol

2017-04-10 Thread Lars Schneider

> On 09 Apr 2017, at 21:11, Lars Schneider  wrote:
> 
> Some `clean` / `smudge` filters might require a significant amount of
> time to process a single blob. During this process the Git checkout
> operation is blocked and Git needs to wait until the filter is done to
> continue with the checkout.
> 
> Teach the filter process protocol (introduced in edcc858) to accept the
> status "delayed" as response to a filter request. Upon this response Git
> continues with the checkout operation. After the checkout operation Git
> calls "finish_delayed_checkout" which queries the filter for remaining
> blobs. If the filter is still working on the completion, then the filter
> is expected to block. If the filter has completed all remaining blobs
> then an empty response is expected.
> 
> Git has a multiple code paths that checkout a blob. Support delayed
> checkouts only in `clone` (in unpack-trees.c) and `checkout` operations.
> 
> Signed-off-by: Lars Schneider 
> ---
> ...
> diff --git a/convert.h b/convert.h
> index 82871a11d5..da6c702090 100644
> --- a/convert.h
> +++ b/convert.h
> @@ -42,6 +42,11 @@ extern int convert_to_git(const char *path, const char 
> *src, size_t len,
> struct strbuf *dst, enum safe_crlf checksafe);
> extern int convert_to_working_tree(const char *path, const char *src,
>  size_t len, struct strbuf *dst);
> +extern int async_convert_to_working_tree(const char *path, const char *src,
> +  size_t len, struct strbuf *dst,
> +  void *dco);
> 

I don't like the void pointer here. However, "cache.h" includes "convert.h" and
therefore "convert.h" cannot include "cache.h". That's why "convert.h" doesn't
know about "struct delayed_checkout". 

I just realized that I could move "struct delayed_checkout" and "enum 
ce_delay_state"
definition from "cache.h" to "convert.h" to solve the problem nicely.

Any objection to this approach?

Thanks,
Lars