Re: [PATCH] sha1_file: use access(), not lstat(), if possible

2017-07-25 Thread Johannes Schindelin
Hi,

On Sat, 22 Jul 2017, Junio C Hamano wrote:

> Johannes Schindelin  writes:
> 
> > But this whole thread taps into a gripe I have with parts of Git's code
> > base: part of the code is not clear at all in its intent by virtue of
> > calling whatever POSIX function may seem to give the answer for the
> > intended question, instead of implementing a function whose name says
> > precisely what question is asked.
> >
> > In this instance, we do not call a helper get_file_size(). Oh no. That
> > would make it too obvious. We call lstat() instead.
> 
> I agree with you for this case and a case like this in general.  
> 
> In codepaths at a lot lower level (they tend to be the ancient and
> quite fundamental ones) in our codebase, lstat() is often directly
> used by the caller because they are interested not only in a single
> aspect of a path but many fields in struct stat are of interest.
> 
> When the code is interested in existence or size or whatever single
> aspect of a path and nothing else, however, the code would become
> easier to read if a helper function with a more specific name is
> used.  And it may even help individual platforms that do not want to
> use the full lstat() emulation, by telling them that other fields in
> struct stat are not needed.
> 
> Of course, then the issue becomes what to do when we are interested
> in not just one but a selected few attributes.  Perhaps we create a
> helper "get_A_B_and_C_attributes_for_path()", which may use lstat()
> on POSIX and the most efficient way to get only A, B and C attributes
> on non-POSIX platforms.  The implementation would be OK, but the naming
> becomes a bit hard; we need to give it a good name.
> 
> Things gets even more interesting when the set of attributes we are
> interested in grows by one and we need to rename the function to
> "get_A_B_C_and_D_attributes_for_path()".  When it is a lot easier to
> fall back to the full lstat() emulation on non-POSIX platforms, the
> temptation to just use it even though it would grab attributes that
> are not needed in that function grows, which needs to be resisted by
> those who are doing the actual implementation for a particular platform.

It becomes a lot easier to fall back to lstat(), if a lot less readable,
yes.

Until, that is, one realises that the function name does not have to
encode what information is sought. It can be a bit field in a parameter
instead. There are even precendents in Git's own source code for that
rather smart paradigm.

Ciao,
Dscho


Re: [PATCH] sha1_file: use access(), not lstat(), if possible

2017-07-22 Thread Junio C Hamano
Johannes Schindelin  writes:

> But this whole thread taps into a gripe I have with parts of Git's code
> base: part of the code is not clear at all in its intent by virtue of
> calling whatever POSIX function may seem to give the answer for the
> intended question, instead of implementing a function whose name says
> precisely what question is asked.
>
> In this instance, we do not call a helper get_file_size(). Oh no. That
> would make it too obvious. We call lstat() instead.

I agree with you for this case and a case like this in general.  

In codepaths at a lot lower level (they tend to be the ancient and
quite fundamental ones) in our codebase, lstat() is often directly
used by the caller because they are interested not only in a single
aspect of a path but many fields in struct stat are of interest.

When the code is interested in existence or size or whatever single
aspect of a path and nothing else, however, the code would become
easier to read if a helper function with a more specific name is
used.  And it may even help individual platforms that do not want to
use the full lstat() emulation, by telling them that other fields in
struct stat are not needed.

Of course, then the issue becomes what to do when we are interested
in not just one but a selected few attributes.  Perhaps we create a
helper "get_A_B_and_C_attributes_for_path()", which may use lstat()
on POSIX and the most efficient way to get only A, B and C attributes
on non-POSIX platforms.  The implementation would be OK, but the naming
becomes a bit hard; we need to give it a good name.

Things gets even more interesting when the set of attributes we are
interested in grows by one and we need to rename the function to
"get_A_B_C_and_D_attributes_for_path()".  When it is a lot easier to
fall back to the full lstat() emulation on non-POSIX platforms, the
temptation to just use it even though it would grab attributes that
are not needed in that function grows, which needs to be resisted by
those who are doing the actual implementation for a particular platform.


Re: [PATCH] sha1_file: use access(), not lstat(), if possible

2017-07-22 Thread Johannes Schindelin
Hi,

On Thu, 20 Jul 2017, Junio C Hamano wrote:

> Jonathan Tan  writes:
> 
> > In sha1_loose_object_info(), use access() (indirectly invoked through
> > has_loose_object()) instead of lstat() if we do not need the on-disk
> > size, as it should be faster on Windows [1].
> 
> That sounds as if Windows is the only thing that matters.  "It is
> faster in general, and is much faster on Windows" would have been
> more convincing, and "It isn't slower, and is much faster on
> Windows" would also have been OK.  Do we have any measurement, or
> this patch does not yield any measuable gain?  
> 
> By the way, the special casing of disk_sizep (which is only used by
> the batch-check feature of cat-file) is somewhat annoying with or
> without this patch, but this change makes it even more so by adding
> an extra indentation level.  I do not think of a way to make it less
> annoying offhand, and I do not think this change needs to address it
> in any way, but I am mentioning this as a hint to bystanders who may
> want to find something small that can be cleaned up ;-)

I actually found a separate piece of information in the meantime:

https://blogs.msdn.microsoft.com/oldnewthing/20071023-00/?p=24713#comment-562083

i.e. _waccess() is implemented in the same way our mingw_lstat()
implementation is: by calling the very same GetFileAttributes() code path.
So it has exactly the same performance characteristics, and I was wrong.

But this whole thread taps into a gripe I have with parts of Git's code
base: part of the code is not clear at all in its intent by virtue of
calling whatever POSIX function may seem to give the answer for the
intended question, instead of implementing a function whose name says
precisely what question is asked.

In this instance, we do not call a helper get_file_size(). Oh no. That
would make it too obvious. We call lstat() instead -- under the assumption
that the whole world runs on Linux, really, because let's be honest about
it: lstat() implementations all differ in subtle ways and we really only
test on Linux.

The obviousness of something like get_file_size() would be so refreshing
to these tired eyes.

Oh, and it would make it much easier to maintain ports to other Operating
Systems, most notably Windows.

Ciao,
Dscho


Re: [PATCH] sha1_file: use access(), not lstat(), if possible

2017-07-20 Thread Junio C Hamano
Jonathan Tan  writes:

> In sha1_loose_object_info(), use access() (indirectly invoked through
> has_loose_object()) instead of lstat() if we do not need the on-disk
> size, as it should be faster on Windows [1].

That sounds as if Windows is the only thing that matters.  "It is
faster in general, and is much faster on Windows" would have been
more convincing, and "It isn't slower, and is much faster on
Windows" would also have been OK.  Do we have any measurement, or
this patch does not yield any measuable gain?  

By the way, the special casing of disk_sizep (which is only used by
the batch-check feature of cat-file) is somewhat annoying with or
without this patch, but this change makes it even more so by adding
an extra indentation level.  I do not think of a way to make it less
annoying offhand, and I do not think this change needs to address it
in any way, but I am mentioning this as a hint to bystanders who may
want to find something small that can be cleaned up ;-)

Thanks.

>
> [1] 
> https://public-inbox.org/git/alpine.DEB.2.21.1.1707191450570.4193@virtualbox/
>
> Signed-off-by: Jonathan Tan 
> ---
> Thanks for the information - here's a patch. Do you, by any chance, know
> of a web page (or similar thing) that I can cite for this?
> ---
>  sha1_file.c | 21 ++---
>  1 file changed, 10 insertions(+), 11 deletions(-)
>
> diff --git a/sha1_file.c b/sha1_file.c
> index fca165f13..81962b019 100644
> --- a/sha1_file.c
> +++ b/sha1_file.c
> @@ -2920,20 +2920,19 @@ static int sha1_loose_object_info(const unsigned char 
> *sha1,
>  
>   /*
>* If we don't care about type or size, then we don't
> -  * need to look inside the object at all. Note that we
> -  * do not optimize out the stat call, even if the
> -  * caller doesn't care about the disk-size, since our
> -  * return value implicitly indicates whether the
> -  * object even exists.
> +  * need to look inside the object at all. We only check
> +  * for its existence.
>*/
>   if (!oi->typep && !oi->typename && !oi->sizep && !oi->contentp) {
> - const char *path;
> - struct stat st;
> - if (stat_sha1_file(sha1, , ) < 0)
> - return -1;
> - if (oi->disk_sizep)
> + if (oi->disk_sizep) {
> + const char *path;
> + struct stat st;
> + if (stat_sha1_file(sha1, , ) < 0)
> + return -1;
>   *oi->disk_sizep = st.st_size;
> - return 0;
> + return 0;
> + }
> + return has_loose_object(sha1) ? 0 : -1;
>   }
>  
>   map = map_sha1_file(sha1, );


[PATCH] sha1_file: use access(), not lstat(), if possible

2017-07-19 Thread Jonathan Tan
In sha1_loose_object_info(), use access() (indirectly invoked through
has_loose_object()) instead of lstat() if we do not need the on-disk
size, as it should be faster on Windows [1].

[1] 
https://public-inbox.org/git/alpine.DEB.2.21.1.1707191450570.4193@virtualbox/

Signed-off-by: Jonathan Tan 
---
Thanks for the information - here's a patch. Do you, by any chance, know
of a web page (or similar thing) that I can cite for this?
---
 sha1_file.c | 21 ++---
 1 file changed, 10 insertions(+), 11 deletions(-)

diff --git a/sha1_file.c b/sha1_file.c
index fca165f13..81962b019 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -2920,20 +2920,19 @@ static int sha1_loose_object_info(const unsigned char 
*sha1,
 
/*
 * If we don't care about type or size, then we don't
-* need to look inside the object at all. Note that we
-* do not optimize out the stat call, even if the
-* caller doesn't care about the disk-size, since our
-* return value implicitly indicates whether the
-* object even exists.
+* need to look inside the object at all. We only check
+* for its existence.
 */
if (!oi->typep && !oi->typename && !oi->sizep && !oi->contentp) {
-   const char *path;
-   struct stat st;
-   if (stat_sha1_file(sha1, , ) < 0)
-   return -1;
-   if (oi->disk_sizep)
+   if (oi->disk_sizep) {
+   const char *path;
+   struct stat st;
+   if (stat_sha1_file(sha1, , ) < 0)
+   return -1;
*oi->disk_sizep = st.st_size;
-   return 0;
+   return 0;
+   }
+   return has_loose_object(sha1) ? 0 : -1;
}
 
map = map_sha1_file(sha1, );
-- 
2.14.0.rc0.284.gd933b75aa4-goog