Re: [PATCH v1 2/2] read-cache: make sure file handles are not inherited by child processes

2016-09-08 Thread Lars Schneider

> On 07 Sep 2016, at 20:23, Junio C Hamano  wrote:
> 
> Eric Wong  writes:
> 
>> We probably should be using O_NOATIME for all O_RDONLY cases
>> to get the last bit of performance out (especially since
>> non-modern-Linux systems probably still lack relatime).
> 
> No, please do not go there.
> 
> The user can read from a file in a working tree using "less",
> "grep", etc., and they all update the atime, so should "git grep".
> We do not use atime ourselves on these files but we should let
> outside tools rely on the validity of atime (e.g. "what are the
> files that were looked at yesterday?").
> 
> If you grep for noatime in our current codebase, you'd notice that
> we use it only for files in objects/ hierarchy, and that makes very
> good sense.  These files are what we create for our _sole_ use and
> no other tools can peek at them and expect to get any useful
> information out of them (we hear from time to time that virus
> scanners leaving open file descriptors on them causing trouble, but
> that is an example of a useless access), and that makes a file in
> objects/ hierarchy a fair game for noatime optimization.

How do we deal with read-cache:ce_compare_data, though?

By your definition above we shouldn't use NOATIME since the read file
is not in objects/. However, the file read is not something the user
explicitly triggers. The read is part of the Git internal "clean"
machinery.

What would you suggest? Should I open the file with or without NOATIME?

Thanks,
Lars


Re: [PATCH v1 2/2] read-cache: make sure file handles are not inherited by child processes

2016-09-07 Thread Junio C Hamano
Eric Wong  writes:

> We probably should be using O_NOATIME for all O_RDONLY cases
> to get the last bit of performance out (especially since
> non-modern-Linux systems probably still lack relatime).

No, please do not go there.

The user can read from a file in a working tree using "less",
"grep", etc., and they all update the atime, so should "git grep".
We do not use atime ourselves on these files but we should let
outside tools rely on the validity of atime (e.g. "what are the
files that were looked at yesterday?").

If you grep for noatime in our current codebase, you'd notice that
we use it only for files in objects/ hierarchy, and that makes very
good sense.  These files are what we create for our _sole_ use and
no other tools can peek at them and expect to get any useful
information out of them (we hear from time to time that virus
scanners leaving open file descriptors on them causing trouble, but
that is an example of a useless access), and that makes a file in
objects/ hierarchy a fair game for noatime optimization.


Re: [PATCH v1 2/2] read-cache: make sure file handles are not inherited by child processes

2016-09-07 Thread Eric Wong
Lars Schneider  wrote:
> > On 06 Sep 2016, at 23:06, Eric Wong  wrote:
> > larsxschnei...@gmail.com wrote:
> >> static int ce_compare_data(const struct cache_entry *ce, struct stat *st)
> >> {
> >>int match = -1;
> >> -  int fd = open(ce->name, O_RDONLY);
> >> +  int fd = open(ce->name, O_RDONLY | O_CLOEXEC);
> >> 
> >>if (fd >= 0) {
> >>unsigned char sha1[20];
> > 
> > Also, this needs to check EINVAL when O_CLOEXEC != 0 the same
> > way create_tempfile currently does.  Somebody could be building
> > with modern headers but running an old kernel that doesn't
> > understand O_CLOEXEC.
> > 
> > There should probably be a open() wrapper for handling this case
> > since we're now up to 3 places where open(... O_CLOEXEC) is
> > used.
> 
> Right! Actually "sha1_file.c:git_open_noatime()" is already a wrapper, no?
> Can't we use this here? The O_NOATIME flag shouldn't hurt, right?

For ce_compare_data (and other O_RDONLY users), I guess
git_open_noatime is fine; and probably preferable because of
O_NOATIME.

We probably should be using O_NOATIME for all O_RDONLY cases
to get the last bit of performance out (especially since
non-modern-Linux systems probably still lack relatime).

However, create_tempfile needs O_RDWR|O_CREAT|O_EXCL
but I guess we can clean that up in another series.


Re: [PATCH v1 2/2] read-cache: make sure file handles are not inherited by child processes

2016-09-07 Thread Lars Schneider

> On 06 Sep 2016, at 23:06, Eric Wong  wrote:
> 
> larsxschnei...@gmail.com wrote:
>> static int ce_compare_data(const struct cache_entry *ce, struct stat *st)
>> {
>>  int match = -1;
>> -int fd = open(ce->name, O_RDONLY);
>> +int fd = open(ce->name, O_RDONLY | O_CLOEXEC);
>> 
>>  if (fd >= 0) {
>>  unsigned char sha1[20];
> 
> Also, this needs to check EINVAL when O_CLOEXEC != 0 the same
> way create_tempfile currently does.  Somebody could be building
> with modern headers but running an old kernel that doesn't
> understand O_CLOEXEC.
> 
> There should probably be a open() wrapper for handling this case
> since we're now up to 3 places where open(... O_CLOEXEC) is
> used.

Right! Actually "sha1_file.c:git_open_noatime()" is already a wrapper, no?
Can't we use this here? The O_NOATIME flag shouldn't hurt, right?

Thanks,
Lars



Re: [PATCH v1 2/2] read-cache: make sure file handles are not inherited by child processes

2016-09-06 Thread Eric Wong
larsxschnei...@gmail.com wrote:
>  static int ce_compare_data(const struct cache_entry *ce, struct stat *st)
>  {
>   int match = -1;
> - int fd = open(ce->name, O_RDONLY);
> + int fd = open(ce->name, O_RDONLY | O_CLOEXEC);
>  
>   if (fd >= 0) {
>   unsigned char sha1[20];

Also, this needs to check EINVAL when O_CLOEXEC != 0 the same
way create_tempfile currently does.  Somebody could be building
with modern headers but running an old kernel that doesn't
understand O_CLOEXEC.

There should probably be a open() wrapper for handling this case
since we're now up to 3 places where open(... O_CLOEXEC) is
used.


Re: [PATCH v1 2/2] read-cache: make sure file handles are not inherited by child processes

2016-09-06 Thread Johannes Schindelin
Hi Lars,

On Mon, 5 Sep 2016, larsxschnei...@gmail.com wrote:

> [... commit message ...]

Makes sense.

> diff --git a/read-cache.c b/read-cache.c
> index 491e52d..02f74d3 100644
> --- a/read-cache.c
> +++ b/read-cache.c
> @@ -156,7 +156,7 @@ void fill_stat_cache_info(struct cache_entry *ce, struct 
> stat *st)
>  static int ce_compare_data(const struct cache_entry *ce, struct stat *st)
>  {
>   int match = -1;
> - int fd = open(ce->name, O_RDONLY);
> + int fd = open(ce->name, O_RDONLY | O_CLOEXEC);

Eric's comment on 1/2 applies here, too, of course: should this cause any
problems on non-Windows platforms, we always have that FD_CLOEXEC thing
that we could probably use to fix it.

But let's cross that bridge when (or better: if) we get there.

Ciao,
Dscho


[PATCH v1 2/2] read-cache: make sure file handles are not inherited by child processes

2016-09-05 Thread larsxschneider
From: Lars Schneider 

This fix prepares a series with the goal to avoid launching a new
clean/smudge filter process for each file that is filtered. A new
long running filter process is introduced that is used to filter all
files in a single Git invocation.

Consider the case of a file that requires filtering and is present in
branch A but not in branch B. If A is the current HEAD and we checkout B
then the following happens:

1. ce_compare_data() opens the file
2.   index_fd() detects that the file requires to run a clean filter and
 calls index_stream_convert_blob()
4. index_stream_convert_blob() calls convert_to_git_filter_fd()
5.   convert_to_git_filter_fd() calls apply_filter() which creates a
 new long running filter process (in case it is the first file
 of this kind to be filtered)
6.   The new filter process inherits all file handles. This is the
 default on Linux/OSX and is explicitly defined in the
 `CreateProcessW` call in `mingw.c` on Windows.
7. ce_compare_data() closes the file
8. Git unlinks the file as it is not present in B

The unlink operation does not work on Windows because the filter process
has still an open handle to the file. On Linux/OSX the unlink operation
succeeds but the file descriptors still leak into the child process.

Fix this problem by opening files in read-cache with the CLOEXEC flag to
ensure that the file descriptor does not remain open in a newly spawned
process similar to 05d1ed61.

Signed-off-by: Lars Schneider 
---
 read-cache.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/read-cache.c b/read-cache.c
index 491e52d..02f74d3 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -156,7 +156,7 @@ void fill_stat_cache_info(struct cache_entry *ce, struct 
stat *st)
 static int ce_compare_data(const struct cache_entry *ce, struct stat *st)
 {
int match = -1;
-   int fd = open(ce->name, O_RDONLY);
+   int fd = open(ce->name, O_RDONLY | O_CLOEXEC);
 
if (fd >= 0) {
unsigned char sha1[20];
-- 
2.10.0