Re: [PATCH/RFC] clone: report duplicate entries on case-insensitive filesystems

2018-08-05 Thread Jeff Hostetler




On 8/3/2018 2:53 PM, Jeff King wrote:

On Fri, Aug 03, 2018 at 02:23:17PM -0400, Jeff Hostetler wrote:


Maybe. It might not work as ino_t. Or it might be expensive to get.  Or
maybe it's simply impossible. I don't know much about Windows. Some
searching implies that NTFS does have a "file index" concept which is
supposed to be unique.


This is hard and/or expensive on Windows.  Yes, you can get the
"file index" values for an open file handle with a cost similar to
an fstat().  Unfortunately, the FindFirst/FindNext routines (equivalent
to the opendir/readdir routines), don't give you that data.  So we'd
have to scan the directory and then open and stat each file.  This is
terribly expensive on Windows -- and the reason we have the fscache
layer (in the GfW version) to intercept the lstat() calls whenever
possible.


I think that high cost might be OK for our purposes here. This code
would _only_ kick in during a clone, and then only on the error path
once we knew we had a collision during the checkout step.



Good point.

I've confirmed that the "file index" values can be used to determine
whether 2 path names are equivalent under NTFS for case variation,
final-dot/space, and short-names vs long-names.

I ran out of time this morning to search the directory for equivalent 
paths.  I'll look at that shortly.


Jeff


Re: [PATCH/RFC] clone: report duplicate entries on case-insensitive filesystems

2018-08-03 Thread Jeff King
On Fri, Aug 03, 2018 at 02:23:17PM -0400, Jeff Hostetler wrote:

> > Maybe. It might not work as ino_t. Or it might be expensive to get.  Or
> > maybe it's simply impossible. I don't know much about Windows. Some
> > searching implies that NTFS does have a "file index" concept which is
> > supposed to be unique.
> 
> This is hard and/or expensive on Windows.  Yes, you can get the
> "file index" values for an open file handle with a cost similar to
> an fstat().  Unfortunately, the FindFirst/FindNext routines (equivalent
> to the opendir/readdir routines), don't give you that data.  So we'd
> have to scan the directory and then open and stat each file.  This is
> terribly expensive on Windows -- and the reason we have the fscache
> layer (in the GfW version) to intercept the lstat() calls whenever
> possible.

I think that high cost might be OK for our purposes here. This code
would _only_ kick in during a clone, and then only on the error path
once we knew we had a collision during the checkout step.

> Another thing to keep in mind is that the collision could be because
> of case folding (or other such nonsense) on a directory in the path.
> I mean, if someone on Linux builds a commit containing:
> 
> a/b/c/D/e/foo.txt
> a/b/c/d/e/foo.txt
> 
> we'll get a similar collision as if one of them were spelled "FOO.txt".

True, though I think that may be OK. If you had conflicting directories
you'd get a _ton_ of duplicates listed, but that makes sense: you
actually have a ton of duplicates.

> Also, do we need to worry about hard-links or symlinks here?

I think we can ignore hardlinks. Git never creates them, and we know the
directory was empty when we started. Symlinks should be handled by using
lstat(). (Obviously that's for a Unix-ish platform).

> I'm sure there are other edge cases here that make reporting
> difficult; these are just a few I thought of.  I guess what I'm
> trying to say is that as a first step just report that you found
> a collision -- without trying to identify the set existing objects
> that it collided with.

I certainly don't disagree with that. :)

> > At any rate, until we have an actual plan for Windows, I think it would
> > make sense only to split the cases into "has working inodes" and
> > "other", and make sure "other" does something sensible in the meantime
> > (like mention the conflict, but skip trying to list duplicates).
> 
> Yes, this should be split.  Do the "easy" Linux version first.
> Keep in mind that there may also be a different solution for the Mac.

I assumed that an inode-based solution would work for Mac, since it's
mostly BSD under the hood. There may be subtleties I don't know about,
though.

-Peff


Re: [PATCH/RFC] clone: report duplicate entries on case-insensitive filesystems

2018-08-03 Thread Junio C Hamano
Jeff Hostetler  writes:

> Another thing to keep in mind is that the collision could be because
> of case folding (or other such nonsense) on a directory in the path.
> I mean, if someone on Linux builds a commit containing:
>
> a/b/c/D/e/foo.txt
> a/b/c/d/e/foo.txt
>
> we'll get a similar collision as if one of them were spelled "FOO.txt".

I'd think the approach to teach checkout_entry() codepath to notice
it needed to unlink the existing file in order to check out the
entry it wanted to check out would cover this equally well.

> Also, do we need to worry about hard-links or symlinks here?

I do not think so.  You do not get a file with multiple hardlinks
in a "git clone" or "git checkout" result, and we do not check
things out beyond a symbolic link in the first place.

> If checkout populates symlinks, then you might have another collision
> opportunity.  For example:
>
> a/b/c/D/e/foo.txt
> a/link -> ./b/c/d
> a/link/e/foo.txt

In other words, a tree with a/link (symlink) and a/link/
that requires a/link to be a symlink and a directory at the same
time cannot be created, so you won't get one with "git clone"

> Also, some platforms (like the Mac) allow directory hard-links.
> Granted, Git doesn't create hard-links during checkout, but the
> user might.

And we'd report "we are doing a fresh checkout immediately after a
clone and saw some file we haven't created, which may indicate a
case smashing filesystem glitch (or a competing third-party process
creating random files)", so noticing that would be a good thing, I
would think.

> I'm sure there are other edge cases here that make reporting
> difficult; these are just a few I thought of.  I guess what I'm
> trying to say is that as a first step just report that you found
> a collision -- without trying to identify the set existing objects
> that it collided with.

Yup, I think that is sensible.  If it can be done cheaply, i.e. on a
filesystem with trustable and cheap inum, after noticing such a
collision, go back and lstat() all paths in the index we have
checked out so far to see which ones are colliding, it adds useful
clue to the report, but noticing the collision in the first place
obviously has more value.


Re: [PATCH/RFC] clone: report duplicate entries on case-insensitive filesystems

2018-08-03 Thread Jeff Hostetler




On 8/2/2018 5:28 PM, Jeff King wrote:

On Thu, Aug 02, 2018 at 02:14:30PM -0700, Junio C Hamano wrote:


Jeff King  writes:


I also wonder if Windows could return some other file-unique identifier
that would work in place of an inode here. That would be pretty easy to
swap in via an #ifdef's helper function. I'd be OK shipping without that
and letting Windows folks fill it in later (as long as we do not do
anything too stupid until then, like claim all of the inode==0 files are
the same).


Yeah, but such a useful file-unique identifier would probably be
used in place of inum in their (l)stat emulation already, if exists,
no?


Maybe. It might not work as ino_t. Or it might be expensive to get.  Or
maybe it's simply impossible. I don't know much about Windows. Some
searching implies that NTFS does have a "file index" concept which is
supposed to be unique.


This is hard and/or expensive on Windows.  Yes, you can get the
"file index" values for an open file handle with a cost similar to
an fstat().  Unfortunately, the FindFirst/FindNext routines (equivalent
to the opendir/readdir routines), don't give you that data.  So we'd
have to scan the directory and then open and stat each file.  This is
terribly expensive on Windows -- and the reason we have the fscache
layer (in the GfW version) to intercept the lstat() calls whenever
possible.

It might be possible to use the NTFS Master File Table to discover
this (very big handwave), but I would need to do a little digging.

This would all be NTFS specific.  FAT and other volume types would not
be covered.

Another thing to keep in mind is that the collision could be because
of case folding (or other such nonsense) on a directory in the path.
I mean, if someone on Linux builds a commit containing:

a/b/c/D/e/foo.txt
a/b/c/d/e/foo.txt

we'll get a similar collision as if one of them were spelled "FOO.txt".

Also, do we need to worry about hard-links or symlinks here?
If checkout populates symlinks, then you might have another collision
opportunity.  For example:

a/b/c/D/e/foo.txt
a/link -> ./b/c/d
a/link/e/foo.txt

Also, some platforms (like the Mac) allow directory hard-links.
Granted, Git doesn't create hard-links during checkout, but the
user might.

I'm sure there are other edge cases here that make reporting
difficult; these are just a few I thought of.  I guess what I'm
trying to say is that as a first step just report that you found
a collision -- without trying to identify the set existing objects
that it collided with.



At any rate, until we have an actual plan for Windows, I think it would
make sense only to split the cases into "has working inodes" and
"other", and make sure "other" does something sensible in the meantime
(like mention the conflict, but skip trying to list duplicates).


Yes, this should be split.  Do the "easy" Linux version first.
Keep in mind that there may also be a different solution for the Mac.


When somebody wants to work on Windows support, then we can figure out
if it just needs to wrap the "get unique identifier" operation, or if it
would use a totally different algorithm.

-Peff



Jeff


Re: [PATCH/RFC] clone: report duplicate entries on case-insensitive filesystems

2018-08-03 Thread Torsten Bögershausen
On 2018-08-02 15:43, Duy Nguyen wrote:
> On Wed, Aug 1, 2018 at 11:20 PM Junio C Hamano  wrote:
>>
>> Junio C Hamano  writes:
>>
>>> Jeff King  writes:
>>>
> Presumably we are already in an error codepath, so if it is
> absolutely necessary, then we can issue a lstat() to grab the inum
> for the path we are about to create, iterate over the previously
> checked out paths issuing lstat() and see which one yields the same
> inum, to find the one who is the culprit.

 Yes, this is the cleverness I was missing in my earlier response.

 So it seems do-able, and I like that this incurs no cost in the
 non-error case.
>>>
>>> Not so fast, unfortunately.
>>>
>>> I suspect that some filesystems do not give us inum that we can use
>>> for that "identity" purpose, and they tend to be the ones with the
>>> case smashing characteristics where we need this code in the error
>>> path the most X-<.
>>
>> But even if inum is unreliable, we should be able to use other
>> clues, perhaps the same set of fields we use for cached stat
>> matching optimization we use for "diff" plumbing commands, to
>> implement the error report.  The more important part of the idea is
>> that we already need to notice that we need to remove a path that is
>> in the working tree while doing the checkout, so the alternative
>> approach won't incur any extra cost for normal cases where the
>> project being checked out does not have two files whose pathnames
>> are only different in case (or checking out such an offending
>> project to a case sensitive filesytem, of course).
>>
>> So I guess it still _is_ workable.  Any takers?
> 
> OK so we're going back to the original way of checking that we check
> out the different files on the same place (because fs is icase) and
> try to collect all paths for reporting, yes?

I would say: Yes.

> I can give it another go
> (but of course if anybody else steps up, I'd very gladly hand this
> over)
> 

Not at the moment.




Re: [PATCH/RFC] clone: report duplicate entries on case-insensitive filesystems

2018-08-02 Thread Jeff King
On Thu, Aug 02, 2018 at 02:14:30PM -0700, Junio C Hamano wrote:

> Jeff King  writes:
> 
> > I also wonder if Windows could return some other file-unique identifier
> > that would work in place of an inode here. That would be pretty easy to
> > swap in via an #ifdef's helper function. I'd be OK shipping without that
> > and letting Windows folks fill it in later (as long as we do not do
> > anything too stupid until then, like claim all of the inode==0 files are
> > the same).
> 
> Yeah, but such a useful file-unique identifier would probably be
> used in place of inum in their (l)stat emulation already, if exists,
> no?

Maybe. It might not work as ino_t. Or it might be expensive to get.  Or
maybe it's simply impossible. I don't know much about Windows. Some
searching implies that NTFS does have a "file index" concept which is
supposed to be unique.

At any rate, until we have an actual plan for Windows, I think it would
make sense only to split the cases into "has working inodes" and
"other", and make sure "other" does something sensible in the meantime
(like mention the conflict, but skip trying to list duplicates).

When somebody wants to work on Windows support, then we can figure out
if it just needs to wrap the "get unique identifier" operation, or if it
would use a totally different algorithm.

-Peff


Re: [PATCH/RFC] clone: report duplicate entries on case-insensitive filesystems

2018-08-02 Thread Junio C Hamano
Jeff King  writes:

> I also wonder if Windows could return some other file-unique identifier
> that would work in place of an inode here. That would be pretty easy to
> swap in via an #ifdef's helper function. I'd be OK shipping without that
> and letting Windows folks fill it in later (as long as we do not do
> anything too stupid until then, like claim all of the inode==0 files are
> the same).

Yeah, but such a useful file-unique identifier would probably be
used in place of inum in their (l)stat emulation already, if exists,
no?

> PS It occurs to me that doing this naively (re-scan the entries already
>checked out when we see a collision) ends up quadratic over the
>number of entries in the worst case. That may not matter. You'd only
>have a handful of collisions normally, and anybody malicious can
>already git-bomb your checkout anyway. If we care, an alternative
>would be to set a flag for "I saw some collisions", and then follow
>up with a single pass putting entries into a hashmap of
>inode->filename.

Yeah, that makes sense.


Re: [PATCH/RFC] clone: report duplicate entries on case-insensitive filesystems

2018-08-02 Thread Jeff King
On Thu, Aug 02, 2018 at 09:27:31AM -0700, Junio C Hamano wrote:

> > OK so we're going back to the original way of checking that we check
> > out the different files on the same place (because fs is icase) and
> > try to collect all paths for reporting, yes? I can give it another go
> > (but of course if anybody else steps up, I'd very gladly hand this
> > over)
> 
> Detect and report, definitely yes; I am not sure about collect all
> (personally I am OK if we stopped at reporting "I tried to check out
> X but your project tree has something else that is turned to X by
> your pathname-smashing filesystem" without making it a requirement
> to report what the other one that conflict with X is.  Of course,
> reporting the other side _is_ nicer and I'd be happier if we can do
> so without too much ugly code, but I do not think it is a hard
> requirement.

Yeah, I think it would be OK to issue the warning for the conflicted
path, and then if we _can_ produce the secondary list of colliding
paths, do so. Even on a system with working inodes, we may not come up
with a match (e.g., if it wasn't us who wrote the file, but rather we
raced with some other process).

I also wonder if Windows could return some other file-unique identifier
that would work in place of an inode here. That would be pretty easy to
swap in via an #ifdef's helper function. I'd be OK shipping without that
and letting Windows folks fill it in later (as long as we do not do
anything too stupid until then, like claim all of the inode==0 files are
the same).

-Peff

PS It occurs to me that doing this naively (re-scan the entries already
   checked out when we see a collision) ends up quadratic over the
   number of entries in the worst case. That may not matter. You'd only
   have a handful of collisions normally, and anybody malicious can
   already git-bomb your checkout anyway. If we care, an alternative
   would be to set a flag for "I saw some collisions", and then follow
   up with a single pass putting entries into a hashmap of
   inode->filename.


Re: [PATCH/RFC] clone: report duplicate entries on case-insensitive filesystems

2018-08-02 Thread Junio C Hamano
Duy Nguyen  writes:

>> But even if inum is unreliable, we should be able to use other
>> clues, perhaps the same set of fields we use for cached stat
>> matching optimization we use for "diff" plumbing commands, to
>> implement the error report.  The more important part of the idea is
>> that we already need to notice that we need to remove a path that is
>> in the working tree while doing the checkout, so the alternative
>> approach won't incur any extra cost for normal cases where the
>> project being checked out does not have two files whose pathnames
>> are only different in case (or checking out such an offending
>> project to a case sensitive filesytem, of course).
>>
>> So I guess it still _is_ workable.  Any takers?
>
> OK so we're going back to the original way of checking that we check
> out the different files on the same place (because fs is icase) and
> try to collect all paths for reporting, yes? I can give it another go
> (but of course if anybody else steps up, I'd very gladly hand this
> over)

Detect and report, definitely yes; I am not sure about collect all
(personally I am OK if we stopped at reporting "I tried to check out
X but your project tree has something else that is turned to X by
your pathname-smashing filesystem" without making it a requirement
to report what the other one that conflict with X is.  Of course,
reporting the other side _is_ nicer and I'd be happier if we can do
so without too much ugly code, but I do not think it is a hard
requirement.

Thanks.


Re: [PATCH/RFC] clone: report duplicate entries on case-insensitive filesystems

2018-08-02 Thread Duy Nguyen
On Wed, Aug 1, 2018 at 11:20 PM Junio C Hamano  wrote:
>
> Junio C Hamano  writes:
>
> > Jeff King  writes:
> >
> >>> Presumably we are already in an error codepath, so if it is
> >>> absolutely necessary, then we can issue a lstat() to grab the inum
> >>> for the path we are about to create, iterate over the previously
> >>> checked out paths issuing lstat() and see which one yields the same
> >>> inum, to find the one who is the culprit.
> >>
> >> Yes, this is the cleverness I was missing in my earlier response.
> >>
> >> So it seems do-able, and I like that this incurs no cost in the
> >> non-error case.
> >
> > Not so fast, unfortunately.
> >
> > I suspect that some filesystems do not give us inum that we can use
> > for that "identity" purpose, and they tend to be the ones with the
> > case smashing characteristics where we need this code in the error
> > path the most X-<.
>
> But even if inum is unreliable, we should be able to use other
> clues, perhaps the same set of fields we use for cached stat
> matching optimization we use for "diff" plumbing commands, to
> implement the error report.  The more important part of the idea is
> that we already need to notice that we need to remove a path that is
> in the working tree while doing the checkout, so the alternative
> approach won't incur any extra cost for normal cases where the
> project being checked out does not have two files whose pathnames
> are only different in case (or checking out such an offending
> project to a case sensitive filesytem, of course).
>
> So I guess it still _is_ workable.  Any takers?

OK so we're going back to the original way of checking that we check
out the different files on the same place (because fs is icase) and
try to collect all paths for reporting, yes? I can give it another go
(but of course if anybody else steps up, I'd very gladly hand this
over)
-- 
Duy


Re: [PATCH/RFC] clone: report duplicate entries on case-insensitive filesystems

2018-08-01 Thread Junio C Hamano
Junio C Hamano  writes:

> Jeff King  writes:
>
>>> Presumably we are already in an error codepath, so if it is
>>> absolutely necessary, then we can issue a lstat() to grab the inum
>>> for the path we are about to create, iterate over the previously
>>> checked out paths issuing lstat() and see which one yields the same
>>> inum, to find the one who is the culprit.
>>
>> Yes, this is the cleverness I was missing in my earlier response.
>>
>> So it seems do-able, and I like that this incurs no cost in the
>> non-error case.
>
> Not so fast, unfortunately.  
>
> I suspect that some filesystems do not give us inum that we can use
> for that "identity" purpose, and they tend to be the ones with the
> case smashing characteristics where we need this code in the error
> path the most X-<.

But even if inum is unreliable, we should be able to use other
clues, perhaps the same set of fields we use for cached stat
matching optimization we use for "diff" plumbing commands, to
implement the error report.  The more important part of the idea is
that we already need to notice that we need to remove a path that is
in the working tree while doing the checkout, so the alternative
approach won't incur any extra cost for normal cases where the
project being checked out does not have two files whose pathnames
are only different in case (or checking out such an offending
project to a case sensitive filesytem, of course).

So I guess it still _is_ workable.  Any takers?


Re: [PATCH/RFC] clone: report duplicate entries on case-insensitive filesystems

2018-08-01 Thread Duy Nguyen
On Tue, Jul 31, 2018 at 8:23 PM Torsten Bögershausen  wrote:
> I wonder if we can tell the users more about the "problems"
> and how to avoid them, or to live with them.
>
> This is more loud thinking:
>
> "The following paths only differ in case\n"
> "One a case-insensitive file system only one at a time can be present\n"
> "You may rename one like this:\n"
> "git checkout  && git mv  .1\n"

Jeff gave a couple more options [1] to fix or workaround this. I think
the problem is there is no single recommended way to deal with it. If
there is, we can describe in this warning. But listing multiple
options in this warning may be too much (the wall of text could easily
take half a screen).

Or if people agree on _one_ suggestion, I will gladly put it in.

[1] 
https://public-inbox.org/git/cacsjy8a_uzm7numyernhjmya0eyrqytv7dp2iklznxnboqu...@mail.gmail.com/T/#m60fedd7dc928a4d52eb5919811f84556f391a7b3

> > + fprintf(stderr, "\t%s\n", 
> > dup.items[i].string);
>
> Another question:
> Do we need any quote_path() here ?
> (This may be overkill, since typically the repos with conflicting names
> only use ASCII.)

Would be good to show trailing spaces in path names, so yes.
-- 
Duy


Re: [PATCH/RFC] clone: report duplicate entries on case-insensitive filesystems

2018-08-01 Thread Duy Nguyen
On Tue, Jul 31, 2018 at 8:44 PM Elijah Newren  wrote:
> Is it worth attempting to also warn about paths that only differ in
> UTF-normalization on relevant MacOS systems?

Down this thread, Jeff Hostetler drew a scarier picture of "case"
handling on MacOS and Windows. I think we should start with support
things like utf normalization... in core.ignore first before doing the
warning stuff. At that point we know well how to detect and warn, or
any other pitfalls.
-- 
Duy


Re: [PATCH/RFC] clone: report duplicate entries on case-insensitive filesystems

2018-08-01 Thread Duy Nguyen
On Tue, Jul 31, 2018 at 9:13 PM Junio C Hamano  wrote:
>
> Nguyễn Thái Ngọc Duy   writes:
>
> > Another thing we probably should do is catch in "git checkout" too,
> > not just "git clone" since your linux/unix colleage colleague may
> > accidentally add some files that your mac/windows machine is not very
> > happy with.
>
> Then you would catch it not in checkout but in add, no?

No because the guy who added these may have done it on a
case-sensitive filesystem. Only later when his friend fetches new
changes on a case-insensitive filesytem, the problem becomes real. If
in this scenario, core.ignore is enforced on all machines, then yes we
could catch it at "git add" (and we should already do that or we have
a bug). At least in open source project setting, I think enforcing
core.ignore will not work.
-- 
Duy


Re: [PATCH/RFC] clone: report duplicate entries on case-insensitive filesystems

2018-07-31 Thread Junio C Hamano
Jeff King  writes:

>> Presumably we are already in an error codepath, so if it is
>> absolutely necessary, then we can issue a lstat() to grab the inum
>> for the path we are about to create, iterate over the previously
>> checked out paths issuing lstat() and see which one yields the same
>> inum, to find the one who is the culprit.
>
> Yes, this is the cleverness I was missing in my earlier response.
>
> So it seems do-able, and I like that this incurs no cost in the
> non-error case.

Not so fast, unfortunately.  

I suspect that some filesystems do not give us inum that we can use
for that "identity" purpose, and they tend to be the ones with the
case smashing characteristics where we need this code in the error
path the most X-<.


Re: [PATCH/RFC] clone: report duplicate entries on case-insensitive filesystems

2018-07-31 Thread Jeff King
On Tue, Jul 31, 2018 at 01:12:14PM -0700, Junio C Hamano wrote:

> Jeff King  writes:
> 
> > On Tue, Jul 31, 2018 at 12:12:58PM -0700, Junio C Hamano wrote:
> > ...
> >> collapses two (or more) paths if we go that way.  We only need to
> >> report "we tried to check out X but it seems your filesystem equates
> >> something else that is also in the project to X".
> >
> > Heh. See my similar suggestion in:
> >
> >   https://public-inbox.org/git/20180728095659.ga21...@sigill.intra.peff.net/
> >
> > and the response from Duy.
> 
> Yes, but is there a reason why we need to report what that
> "something else" is?

I don't think it's strictly necessary, but it probably makes things
easier for the user. That said...

> Presumably we are already in an error codepath, so if it is
> absolutely necessary, then we can issue a lstat() to grab the inum
> for the path we are about to create, iterate over the previously
> checked out paths issuing lstat() and see which one yields the same
> inum, to find the one who is the culprit.

Yes, this is the cleverness I was missing in my earlier response.

So it seems do-able, and I like that this incurs no cost in the
non-error case.

-Peff


Re: [PATCH/RFC] clone: report duplicate entries on case-insensitive filesystems

2018-07-31 Thread Junio C Hamano
Jeff King  writes:

> On Tue, Jul 31, 2018 at 12:12:58PM -0700, Junio C Hamano wrote:
> ...
>> collapses two (or more) paths if we go that way.  We only need to
>> report "we tried to check out X but it seems your filesystem equates
>> something else that is also in the project to X".
>
> Heh. See my similar suggestion in:
>
>   https://public-inbox.org/git/20180728095659.ga21...@sigill.intra.peff.net/
>
> and the response from Duy.

Yes, but is there a reason why we need to report what that
"something else" is?

Presumably we are already in an error codepath, so if it is
absolutely necessary, then we can issue a lstat() to grab the inum
for the path we are about to create, iterate over the previously
checked out paths issuing lstat() and see which one yields the same
inum, to find the one who is the culprit.



Re: [PATCH/RFC] clone: report duplicate entries on case-insensitive filesystems

2018-07-31 Thread Jeff King
On Tue, Jul 31, 2018 at 12:12:58PM -0700, Junio C Hamano wrote:

> Elijah Newren  writes:
> 
> > Is it worth attempting to also warn about paths that only differ in
> > UTF-normalization on relevant MacOS systems?
> 
> I hate to bring up a totally different approach this late in the
> party, but I wonder if it makes more sense to take advantage of
> "clone" being a command that starts from an empty working tree.
> 
> builtin/clone.c::checkout() drives a single-tree unpack_trees(),
> using oneway_merge() as its callback and at the end, eventually
> unpack_trees.c:check_updates() will call into checkout_entry()
> to perform the usual "unlink and then create" dance.
> 
> I wonder if it makes sense to introduce a new option to tell the
> machinery to report when the final checkout_entry() notices that it
> needed to remove the working tree file to make room (perhaps that
> bit would go in "struct unpack_trees_options").  In the initial
> checkout codepath for a freshly cloned repository, that would only
> happen when your tree has two (or more) paths that gets smashed
> by case insensitive or UTF-normalizing filesystem, and the code we
> would maintain do not have to care how exactly the filesystem
> collapses two (or more) paths if we go that way.  We only need to
> report "we tried to check out X but it seems your filesystem equates
> something else that is also in the project to X".

Heh. See my similar suggestion in:

  https://public-inbox.org/git/20180728095659.ga21...@sigill.intra.peff.net/

and the response from Duy.

-Peff


Re: [PATCH/RFC] clone: report duplicate entries on case-insensitive filesystems

2018-07-31 Thread Junio C Hamano
Elijah Newren  writes:

> Is it worth attempting to also warn about paths that only differ in
> UTF-normalization on relevant MacOS systems?

I hate to bring up a totally different approach this late in the
party, but I wonder if it makes more sense to take advantage of
"clone" being a command that starts from an empty working tree.

builtin/clone.c::checkout() drives a single-tree unpack_trees(),
using oneway_merge() as its callback and at the end, eventually
unpack_trees.c:check_updates() will call into checkout_entry()
to perform the usual "unlink and then create" dance.

I wonder if it makes sense to introduce a new option to tell the
machinery to report when the final checkout_entry() notices that it
needed to remove the working tree file to make room (perhaps that
bit would go in "struct unpack_trees_options").  In the initial
checkout codepath for a freshly cloned repository, that would only
happen when your tree has two (or more) paths that gets smashed
by case insensitive or UTF-normalizing filesystem, and the code we
would maintain do not have to care how exactly the filesystem
collapses two (or more) paths if we go that way.  We only need to
report "we tried to check out X but it seems your filesystem equates
something else that is also in the project to X".





Re: [PATCH/RFC] clone: report duplicate entries on case-insensitive filesystems

2018-07-31 Thread Junio C Hamano
Nguyễn Thái Ngọc Duy   writes:

> Another thing we probably should do is catch in "git checkout" too,
> not just "git clone" since your linux/unix colleage colleague may
> accidentally add some files that your mac/windows machine is not very
> happy with.

Then you would catch it not in checkout but in add, no?


Re: [PATCH/RFC] clone: report duplicate entries on case-insensitive filesystems

2018-07-31 Thread Elijah Newren
On Mon, Jul 30, 2018 at 8:27 AM, Nguyễn Thái Ngọc Duy  wrote:
> Paths that only differ in case work fine in a case-sensitive
> filesystems, but if those repos are cloned in a case-insensitive one,
> you'll get problems. The first thing to notice is "git status" will
> never be clean with no indication what's exactly is "dirty".

"what" rather than "what's"?

> This patch helps the situation a bit by pointing out the problem at
> clone time. I have not suggested any way to work around or fix this
> problem. But I guess we could probably have a section in
> Documentation/ dedicated to this problem and point there instead of
> a long advice in this warning.
>
> Another thing we probably should do is catch in "git checkout" too,
> not just "git clone" since your linux/unix colleage colleague may

drop "colleage", keep "colleague"?

> accidentally add some files that your mac/windows machine is not very
> happy with. But then there's another problem, once the problem is
> known, we probably should stop spamming this warning at every
> checkout, but how?

Good questions.  I have no answers.

>
> Signed-off-by: Nguyễn Thái Ngọc Duy 
> ---
>  builtin/clone.c | 41 +
>  1 file changed, 41 insertions(+)
>
> diff --git a/builtin/clone.c b/builtin/clone.c
> index 5c439f1394..32738c2737 100644
> --- a/builtin/clone.c
> +++ b/builtin/clone.c
> @@ -711,6 +711,33 @@ static void update_head(const struct ref *our, const 
> struct ref *remote,
> }
>  }
>
> +static void find_duplicate_icase_entries(struct index_state *istate,
> +struct string_list *dup)
> +{
> +   struct string_list list = STRING_LIST_INIT_NODUP;
> +   int i;
> +
> +   for (i = 0; i < istate->cache_nr; i++)
> +   string_list_append(, istate->cache[i]->name);
> +
> +   list.cmp = fspathcmp;
> +   string_list_sort();

So, you sort the list by fspathcmp to get the entries that differ in
case only next to each other.  Makes sense...

> +
> +   for (i = 1; i < list.nr; i++) {
> +   const char *cur = list.items[i].string;
> +   const char *prev = list.items[i - 1].string;
> +
> +   if (dup->nr &&
> +   !fspathcmp(cur, dup->items[dup->nr - 1].string)) {
> +   string_list_append(dup, cur);

If we have at least one duplicate in dup (and currently we'd have to
have at least two), and we now hit yet another (i.e. a third or
fourth...) way of spelling the same path, then we add it.

> +   } else if (!fspathcmp(cur, prev)) {
> +   string_list_append(dup, prev);
> +   string_list_append(dup, cur);
> +   }

...otherwise, if we find a duplicate, we add both spellings to the dup list.

> +   }
> +   string_list_clear(, 0);
> +}
> +
>  static int checkout(int submodule_progress)
>  {
> struct object_id oid;
> @@ -761,6 +788,20 @@ static int checkout(int submodule_progress)
> if (write_locked_index(_index, _file, COMMIT_LOCK))
> die(_("unable to write new index file"));
>
> +   if (ignore_case) {
> +   struct string_list dup = STRING_LIST_INIT_DUP;
> +   int i;
> +
> +   find_duplicate_icase_entries(_index, );
> +   if (dup.nr) {
> +   warning(_("the following paths in this repository 
> only differ in case and will\n"

Perhaps I'm being excessively pedantic, but what if there are multiple
pairs of paths differing in case?  E.g. if someone has readme.txt,
README.txt, foo.txt, and FOO.txt, you'll list all four files but
readme.txt and foo.txt do not only differ in case.

Maybe something like "...only differ in case from another path and
will... " or is that too verbose and annoying?

> + "cause problems because you have cloned it 
> on an case-insensitive filesytem:\n"));
> +   for (i = 0; i < dup.nr; i++)
> +   fprintf(stderr, "\t%s\n", 
> dup.items[i].string);
> +   }
> +   string_list_clear(, 0);
> +   }
> +
> err |= run_hook_le(NULL, "post-checkout", sha1_to_hex(null_sha1),
>oid_to_hex(), "1", NULL);

Is it worth attempting to also warn about paths that only differ in
UTF-normalization on relevant MacOS systems?


Re: [PATCH/RFC] clone: report duplicate entries on case-insensitive filesystems

2018-07-31 Thread Torsten Bögershausen
On Mon, Jul 30, 2018 at 05:27:55PM +0200, Nguyễn Thái Ngọc Duy wrote:
> Paths that only differ in case work fine in a case-sensitive
> filesystems, but if those repos are cloned in a case-insensitive one,
> you'll get problems. The first thing to notice is "git status" will
> never be clean with no indication what's exactly is "dirty".
> 
> This patch helps the situation a bit by pointing out the problem at
> clone time. I have not suggested any way to work around or fix this
> problem. But I guess we could probably have a section in
> Documentation/ dedicated to this problem and point there instead of
> a long advice in this warning.
> 
> Another thing we probably should do is catch in "git checkout" too,
> not just "git clone" since your linux/unix colleage colleague may
> accidentally add some files that your mac/windows machine is not very
> happy with. But then there's another problem, once the problem is
> known, we probably should stop spamming this warning at every
> checkout, but how?
> 
> Signed-off-by: Nguyễn Thái Ngọc Duy 
> ---
>  builtin/clone.c | 41 +
>  1 file changed, 41 insertions(+)
> 
> diff --git a/builtin/clone.c b/builtin/clone.c
> index 5c439f1394..32738c2737 100644
> --- a/builtin/clone.c
> +++ b/builtin/clone.c
> @@ -711,6 +711,33 @@ static void update_head(const struct ref *our, const 
> struct ref *remote,
>   }
>  }
>  
> +static void find_duplicate_icase_entries(struct index_state *istate,
> +  struct string_list *dup)
> +{
> + struct string_list list = STRING_LIST_INIT_NODUP;
> + int i;
> +
> + for (i = 0; i < istate->cache_nr; i++)
> + string_list_append(, istate->cache[i]->name);
> +
> + list.cmp = fspathcmp;
> + string_list_sort();
> +
> + for (i = 1; i < list.nr; i++) {
> + const char *cur = list.items[i].string;
> + const char *prev = list.items[i - 1].string;
> +
> + if (dup->nr &&
> + !fspathcmp(cur, dup->items[dup->nr - 1].string)) {
> + string_list_append(dup, cur);
> + } else if (!fspathcmp(cur, prev)) {
> + string_list_append(dup, prev);
> + string_list_append(dup, cur);
> + }
> + }
> + string_list_clear(, 0);
> +}
> +
>  static int checkout(int submodule_progress)
>  {
>   struct object_id oid;
> @@ -761,6 +788,20 @@ static int checkout(int submodule_progress)
>   if (write_locked_index(_index, _file, COMMIT_LOCK))
>   die(_("unable to write new index file"));
>  
> + if (ignore_case) {
> + struct string_list dup = STRING_LIST_INIT_DUP;
> + int i;
> +
> + find_duplicate_icase_entries(_index, );
> + if (dup.nr) {
> + warning(_("the following paths in this repository only 
> differ in case and will\n"
> +   "cause problems because you have cloned it on 
> an case-insensitive filesytem:\n"));

Thanks for the patch.
I wonder if we can tell the users more about the "problems"
and how to avoid them, or to live with them.

This is more loud thinking:

"The following paths only differ in case\n"
"One a case-insensitive file system only one at a time can be present\n"
"You may rename one like this:\n"
"git checkout  && git mv  .1\n"

> + fprintf(stderr, "\t%s\n", dup.items[i].string);

Another question:
Do we need any quote_path() here ?
(This may be overkill, since typically the repos with conflicting names
only use ASCII.)

> + }
> + string_list_clear(, 0);
> + }
> +
>   err |= run_hook_le(NULL, "post-checkout", sha1_to_hex(null_sha1),
>  oid_to_hex(), "1", NULL);
>  
> -- 
> 2.18.0.656.gda699b98b3
> 


[PATCH/RFC] clone: report duplicate entries on case-insensitive filesystems

2018-07-30 Thread Nguyễn Thái Ngọc Duy
Paths that only differ in case work fine in a case-sensitive
filesystems, but if those repos are cloned in a case-insensitive one,
you'll get problems. The first thing to notice is "git status" will
never be clean with no indication what's exactly is "dirty".

This patch helps the situation a bit by pointing out the problem at
clone time. I have not suggested any way to work around or fix this
problem. But I guess we could probably have a section in
Documentation/ dedicated to this problem and point there instead of
a long advice in this warning.

Another thing we probably should do is catch in "git checkout" too,
not just "git clone" since your linux/unix colleage colleague may
accidentally add some files that your mac/windows machine is not very
happy with. But then there's another problem, once the problem is
known, we probably should stop spamming this warning at every
checkout, but how?

Signed-off-by: Nguyễn Thái Ngọc Duy 
---
 builtin/clone.c | 41 +
 1 file changed, 41 insertions(+)

diff --git a/builtin/clone.c b/builtin/clone.c
index 5c439f1394..32738c2737 100644
--- a/builtin/clone.c
+++ b/builtin/clone.c
@@ -711,6 +711,33 @@ static void update_head(const struct ref *our, const 
struct ref *remote,
}
 }
 
+static void find_duplicate_icase_entries(struct index_state *istate,
+struct string_list *dup)
+{
+   struct string_list list = STRING_LIST_INIT_NODUP;
+   int i;
+
+   for (i = 0; i < istate->cache_nr; i++)
+   string_list_append(, istate->cache[i]->name);
+
+   list.cmp = fspathcmp;
+   string_list_sort();
+
+   for (i = 1; i < list.nr; i++) {
+   const char *cur = list.items[i].string;
+   const char *prev = list.items[i - 1].string;
+
+   if (dup->nr &&
+   !fspathcmp(cur, dup->items[dup->nr - 1].string)) {
+   string_list_append(dup, cur);
+   } else if (!fspathcmp(cur, prev)) {
+   string_list_append(dup, prev);
+   string_list_append(dup, cur);
+   }
+   }
+   string_list_clear(, 0);
+}
+
 static int checkout(int submodule_progress)
 {
struct object_id oid;
@@ -761,6 +788,20 @@ static int checkout(int submodule_progress)
if (write_locked_index(_index, _file, COMMIT_LOCK))
die(_("unable to write new index file"));
 
+   if (ignore_case) {
+   struct string_list dup = STRING_LIST_INIT_DUP;
+   int i;
+
+   find_duplicate_icase_entries(_index, );
+   if (dup.nr) {
+   warning(_("the following paths in this repository only 
differ in case and will\n"
+ "cause problems because you have cloned it on 
an case-insensitive filesytem:\n"));
+   for (i = 0; i < dup.nr; i++)
+   fprintf(stderr, "\t%s\n", dup.items[i].string);
+   }
+   string_list_clear(, 0);
+   }
+
err |= run_hook_le(NULL, "post-checkout", sha1_to_hex(null_sha1),
   oid_to_hex(), "1", NULL);
 
-- 
2.18.0.656.gda699b98b3