Re: [RFC] Native access to Git LFS cache

2016-06-28 Thread Duy Nguyen
On Tue, Jun 28, 2016 at 3:43 PM, Lars Schneider
 wrote:
>
>> On 28 Jun 2016, at 15:14, Johannes Schindelin  
>> wrote:
>>
>> Hi Duy,
>>
>> On Tue, 28 Jun 2016, Duy Nguyen wrote:
>>
>>> On Tue, Jun 28, 2016 at 11:40 AM, Johannes Schindelin
>>>  wrote:

 On Mon, 27 Jun 2016, Duy Nguyen wrote:

> On Mon, Jun 27, 2016 at 7:38 AM,   wrote:
>> ## Proposed solution
>> Git LFS caches its objects under .git/lfs/objects. Most of the time
>> Git LFS objects are already available in the cache (e.g. if you
>> switch branches back and forth). I implemented these "cache hits"
>> natively in Git.  Please note that this implementation is just a
>> quick and dirty proof of concept. If the Git community agrees that
>> this kind of approach would be acceptable then I will start to work
>> on a proper patch series with cross platform support and unit
>> tests.
>
> Would it be possible to move all this code to a separate daemon?
> Instead of spawning a new process to do the filtering, you send a
> command "convert this" over maybe unix socket and either receive the
> whole result over the socket, or receive a path of the result.

 Unix sockets are not really portable...
>>>
>>> It's the same situation as index-helper. I expect you guys will
>>> replace the transport with named pipe or similar.
>>
>> Yes, I will have to work on that. But I might need to ask for a change in
>> the design if I hit some obstacle there: named pipes are not the same at
>> all as Unix sockets.
>>
>> Read: it will be painful, and not a general solution. So every new Unix
>> socket that you introduce will introduce new problems for me.
>
> Thanks Duy for your suggestion. I considered a daemon, but a daemon makes
> it always harder for the user as the user needs to ensure the daemon is
> running! Plus, Dscho's concerns regarding Windows.
>
> I think the core problem is that we invoke the filter for every file:
> https://github.com/git/git/blob/master/convert.c#L461-L475
>
> Couldn't we start the filter executable at the beginning of the Git process
> and communicate with it via stdin/stdout whenever we hit the Git filter
> code? Would that work?

Yeah if one filter process per one git process still brings
significant perf. gain for you, why not, it's simpler than daemon.
Though you may want to look at Christian's external odb first in the
other mail. Note though that external odb may still spawn process a
lot (because the design is you cache objects locally once and you
don't have to spawn again). Whether that fits in lfs scheme, I have no
idea (I have never used git-lfs myself).

> Alternatively, do you see a way to add a "plugin" system to Git? Where Git
> could be configured to dynamically load a "filter" library?

I don't think plugins as .so files are welcome because it would force
us to freeze some ABI. So far all git extension has always been via an
external process.
-- 
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Native access to Git LFS cache

2016-06-28 Thread Duy Nguyen
On Tue, Jun 28, 2016 at 3:14 PM, Johannes Schindelin
 wrote:
> Hi Duy,
>
> On Tue, 28 Jun 2016, Duy Nguyen wrote:
>
>> On Tue, Jun 28, 2016 at 11:40 AM, Johannes Schindelin
>>  wrote:
>> >
>> > On Mon, 27 Jun 2016, Duy Nguyen wrote:
>> >
>> >> On Mon, Jun 27, 2016 at 7:38 AM,   wrote:
>> >> > ## Proposed solution
>> >> > Git LFS caches its objects under .git/lfs/objects. Most of the time
>> >> > Git LFS objects are already available in the cache (e.g. if you
>> >> > switch branches back and forth). I implemented these "cache hits"
>> >> > natively in Git.  Please note that this implementation is just a
>> >> > quick and dirty proof of concept. If the Git community agrees that
>> >> > this kind of approach would be acceptable then I will start to work
>> >> > on a proper patch series with cross platform support and unit
>> >> > tests.
>> >>
>> >> Would it be possible to move all this code to a separate daemon?
>> >> Instead of spawning a new process to do the filtering, you send a
>> >> command "convert this" over maybe unix socket and either receive the
>> >> whole result over the socket, or receive a path of the result.
>> >
>> > Unix sockets are not really portable...
>>
>> It's the same situation as index-helper. I expect you guys will
>> replace the transport with named pipe or similar.
>
> Yes, I will have to work on that. But I might need to ask for a change in
> the design if I hit some obstacle there: named pipes are not the same at
> all as Unix sockets.
>
> Read: it will be painful, and not a general solution. So every new Unix
> socket that you introduce will introduce new problems for me.

I thought we could have a drop-in replacement (or maybe a higher
abstraction that would be sufficient for git). Thanks for pointing it
out.
-- 
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Native access to Git LFS cache

2016-06-28 Thread Christian Couder
On Tue, Jun 28, 2016 at 3:22 PM, Lars Schneider
 wrote:
>
> @Christian/Peff:
> Is there a place to look for more info about your remote-object-store idea?

You may want to take a look at:

https://github.com/chriscool/git/commits/external-odb

I just updated it and I may send an updated RFC series from this
branch to the list soon.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Native access to Git LFS cache

2016-06-28 Thread Lars Schneider

> On 28 Jun 2016, at 15:14, Johannes Schindelin  
> wrote:
> 
> Hi Duy,
> 
> On Tue, 28 Jun 2016, Duy Nguyen wrote:
> 
>> On Tue, Jun 28, 2016 at 11:40 AM, Johannes Schindelin
>>  wrote:
>>> 
>>> On Mon, 27 Jun 2016, Duy Nguyen wrote:
>>> 
 On Mon, Jun 27, 2016 at 7:38 AM,   wrote:
> ## Proposed solution
> Git LFS caches its objects under .git/lfs/objects. Most of the time
> Git LFS objects are already available in the cache (e.g. if you
> switch branches back and forth). I implemented these "cache hits"
> natively in Git.  Please note that this implementation is just a
> quick and dirty proof of concept. If the Git community agrees that
> this kind of approach would be acceptable then I will start to work
> on a proper patch series with cross platform support and unit
> tests.
 
 Would it be possible to move all this code to a separate daemon?
 Instead of spawning a new process to do the filtering, you send a
 command "convert this" over maybe unix socket and either receive the
 whole result over the socket, or receive a path of the result.
>>> 
>>> Unix sockets are not really portable...
>> 
>> It's the same situation as index-helper. I expect you guys will
>> replace the transport with named pipe or similar.
> 
> Yes, I will have to work on that. But I might need to ask for a change in
> the design if I hit some obstacle there: named pipes are not the same at
> all as Unix sockets.
> 
> Read: it will be painful, and not a general solution. So every new Unix
> socket that you introduce will introduce new problems for me.

Thanks Duy for your suggestion. I considered a daemon, but a daemon makes
it always harder for the user as the user needs to ensure the daemon is 
running! Plus, Dscho's concerns regarding Windows.

I think the core problem is that we invoke the filter for every file:
https://github.com/git/git/blob/master/convert.c#L461-L475

Couldn't we start the filter executable at the beginning of the Git process
and communicate with it via stdin/stdout whenever we hit the Git filter 
code? Would that work?

Alternatively, do you see a way to add a "plugin" system to Git? Where Git
could be configured to dynamically load a "filter" library?

@Dscho:
Do you have a recommendation for interprocess communication that works 
without trouble on Windows? 

Thanks,
Lars--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Native access to Git LFS cache

2016-06-28 Thread Lars Schneider

> On 27 Jun 2016, at 18:09, Junio C Hamano  wrote:
> 
> larsxschnei...@gmail.com writes:
> 
>> Unfortunately that fix helps only with cloning. Any local Git operation
>> that invokes the clean/smudge filter (e.g. switching branches) is still
>> slow.
> 
> Do you know where the slowness comes from?  Does Joey's new
> clean/smudge interface help GitLFS?

I am pretty sure the startup time of the external clean/smudge process
causes the slowness and consequently I don't think Joey's patch would help. 
The following tests makes me believe that:

I ran the same test as in my original email using the repo with 15,000 
LFS files. Instead of the LFS binary I use the fast and simple shell 
built-in `true` command:

$ git -c filter.lfs.smudge=true -c filter.lfs.clean=true clone 
https://github.com/larsxschneider/lfstest-manyfiles.git
$ cd lfstest-manyfiles/
$ time git -c filter.lfs.smudge=true -c filter.lfs.clean=true checkout 
removed-files

real0m47.030s
user0m29.521s
sys 0m16.993s

It still takes 47 seconds to switch the branch. Does this test prove my
point or do you see a flaw in the test?


> You are not likely to get anything that knows that a blob object may
> be named as anything other than SHA-1("blob " + ) to
> Git core.  The remote-object-store idea that was floated by Peff and
> Christian started running with at least maintains that object naming
> property and has a better chance of interacting better with the core,
> but LFS, Annex or anything that would not preserve the object naming
> would not.
> 
> Personally, I view a surrogate blob left by LFS in the tree object
> and filtered via clean/smudge a "smarter" kind of symbolic link that
> points outside what Git controls.  The area outside what Git
> controls is left to be managed by whatever the add-on does; Git
> shouldn't even be aware of how they are structured and/or managed.

I understand and somewhat anticipated your point of view. I will try
to find a less intrusive solution.

@Christian/Peff: 
Is there a place to look for more info about your remote-object-store idea? 

Thanks,
Lars--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Native access to Git LFS cache

2016-06-28 Thread Johannes Schindelin
Hi Duy,

On Tue, 28 Jun 2016, Duy Nguyen wrote:

> On Tue, Jun 28, 2016 at 11:40 AM, Johannes Schindelin
>  wrote:
> >
> > On Mon, 27 Jun 2016, Duy Nguyen wrote:
> >
> >> On Mon, Jun 27, 2016 at 7:38 AM,   wrote:
> >> > ## Proposed solution
> >> > Git LFS caches its objects under .git/lfs/objects. Most of the time
> >> > Git LFS objects are already available in the cache (e.g. if you
> >> > switch branches back and forth). I implemented these "cache hits"
> >> > natively in Git.  Please note that this implementation is just a
> >> > quick and dirty proof of concept. If the Git community agrees that
> >> > this kind of approach would be acceptable then I will start to work
> >> > on a proper patch series with cross platform support and unit
> >> > tests.
> >>
> >> Would it be possible to move all this code to a separate daemon?
> >> Instead of spawning a new process to do the filtering, you send a
> >> command "convert this" over maybe unix socket and either receive the
> >> whole result over the socket, or receive a path of the result.
> >
> > Unix sockets are not really portable...
> 
> It's the same situation as index-helper. I expect you guys will
> replace the transport with named pipe or similar.

Yes, I will have to work on that. But I might need to ask for a change in
the design if I hit some obstacle there: named pipes are not the same at
all as Unix sockets.

Read: it will be painful, and not a general solution. So every new Unix
socket that you introduce will introduce new problems for me.

Ciao,
Dscho
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Native access to Git LFS cache

2016-06-28 Thread Duy Nguyen
On Tue, Jun 28, 2016 at 11:40 AM, Johannes Schindelin
 wrote:
> Hi Duy,
>
> On Mon, 27 Jun 2016, Duy Nguyen wrote:
>
>> On Mon, Jun 27, 2016 at 7:38 AM,   wrote:
>> > ## Proposed solution
>> > Git LFS caches its objects under .git/lfs/objects. Most of the time
>> > Git LFS objects are already available in the cache (e.g. if you switch
>> > branches back and forth). I implemented these "cache hits" natively in
>> > Git.  Please note that this implementation is just a quick and dirty
>> > proof of concept. If the Git community agrees that this kind of
>> > approach would be acceptable then I will start to work on a proper
>> > patch series with cross platform support and unit tests.
>>
>> Would it be possible to move all this code to a separate daemon?
>> Instead of spawning a new process to do the filtering, you send a
>> command "convert this" over maybe unix socket and either receive the
>> whole result over the socket, or receive a path of the result.
>
> Unix sockets are not really portable...

It's the same situation as index-helper. I expect you guys will
replace the transport with named pipe or similar.
-- 
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Native access to Git LFS cache

2016-06-28 Thread Johannes Schindelin
Hi Duy,

On Mon, 27 Jun 2016, Duy Nguyen wrote:

> On Mon, Jun 27, 2016 at 7:38 AM,   wrote:
> > ## Proposed solution
> > Git LFS caches its objects under .git/lfs/objects. Most of the time
> > Git LFS objects are already available in the cache (e.g. if you switch
> > branches back and forth). I implemented these "cache hits" natively in
> > Git.  Please note that this implementation is just a quick and dirty
> > proof of concept. If the Git community agrees that this kind of
> > approach would be acceptable then I will start to work on a proper
> > patch series with cross platform support and unit tests.
> 
> Would it be possible to move all this code to a separate daemon?
> Instead of spawning a new process to do the filtering, you send a
> command "convert this" over maybe unix socket and either receive the
> whole result over the socket, or receive a path of the result.

Unix sockets are not really portable...

Ciao,
Dscho
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Native access to Git LFS cache

2016-06-27 Thread Junio C Hamano
larsxschnei...@gmail.com writes:

> Unfortunately that fix helps only with cloning. Any local Git operation
> that invokes the clean/smudge filter (e.g. switching branches) is still
> slow.

Do you know where the slowness comes from?  Does Joey's new
clean/smudge interface help GitLFS?

You are not likely to get anything that knows that a blob object may
be named as anything other than SHA-1("blob " + ) to
Git core.  The remote-object-store idea that was floated by Peff and
Christian started running with at least maintains that object naming
property and has a better chance of interacting better with the core,
but LFS, Annex or anything that would not preserve the object naming
would not.

Personally, I view a surrogate blob left by LFS in the tree object
and filtered via clean/smudge a "smarter" kind of symbolic link that
points outside what Git controls.  The area outside what Git
controls is left to be managed by whatever the add-on does; Git
shouldn't even be aware of how they are structured and/or managed.


--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Native access to Git LFS cache

2016-06-27 Thread Duy Nguyen
On Mon, Jun 27, 2016 at 7:38 AM,   wrote:
> ## Proposed solution
> Git LFS caches its objects under .git/lfs/objects. Most of the time Git
> LFS objects are already available in the cache (e.g. if you switch branches
> back and forth). I implemented these "cache hits" natively in Git.
> Please note that this implementation is just a quick and dirty proof of
> concept. If the Git community agrees that this kind of approach would be
> acceptable then I will start to work on a proper patch series with cross
> platform support and unit tests.

Would it be possible to move all this code to a separate daemon?
Instead of spawning a new process to do the filtering, you send a
command "convert this" over maybe unix socket and either receive the
whole result over the socket, or receive a path of the result.

I don't think hard coding "git-lfs" is a good way to go (if you keep
that in the final impl. of course). I guess the costly part is
spawning processes and going through the same process initialization
for every object. If we keep a daemon running, all that is gone. You
still have to pay for extra context switches and memory copy (unless
you send the path, but then it could be racy), but I think that's
negligible. And all smudge/clean filters can do caching and more if
they want to.
-- 
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html