Re: [PATCH/RFC] clone: report duplicate entries on case-insensitive filesystems
On 8/3/2018 2:53 PM, Jeff King wrote: On Fri, Aug 03, 2018 at 02:23:17PM -0400, Jeff Hostetler wrote: Maybe. It might not work as ino_t. Or it might be expensive to get. Or maybe it's simply impossible. I don't know much about Windows. Some searching implies that NTFS does have a "file index" concept which is supposed to be unique. This is hard and/or expensive on Windows. Yes, you can get the "file index" values for an open file handle with a cost similar to an fstat(). Unfortunately, the FindFirst/FindNext routines (equivalent to the opendir/readdir routines), don't give you that data. So we'd have to scan the directory and then open and stat each file. This is terribly expensive on Windows -- and the reason we have the fscache layer (in the GfW version) to intercept the lstat() calls whenever possible. I think that high cost might be OK for our purposes here. This code would _only_ kick in during a clone, and then only on the error path once we knew we had a collision during the checkout step. Good point. I've confirmed that the "file index" values can be used to determine whether 2 path names are equivalent under NTFS for case variation, final-dot/space, and short-names vs long-names. I ran out of time this morning to search the directory for equivalent paths. I'll look at that shortly. Jeff
Re: [PATCH/RFC] clone: report duplicate entries on case-insensitive filesystems
On Fri, Aug 03, 2018 at 02:23:17PM -0400, Jeff Hostetler wrote: > > Maybe. It might not work as ino_t. Or it might be expensive to get. Or > > maybe it's simply impossible. I don't know much about Windows. Some > > searching implies that NTFS does have a "file index" concept which is > > supposed to be unique. > > This is hard and/or expensive on Windows. Yes, you can get the > "file index" values for an open file handle with a cost similar to > an fstat(). Unfortunately, the FindFirst/FindNext routines (equivalent > to the opendir/readdir routines), don't give you that data. So we'd > have to scan the directory and then open and stat each file. This is > terribly expensive on Windows -- and the reason we have the fscache > layer (in the GfW version) to intercept the lstat() calls whenever > possible. I think that high cost might be OK for our purposes here. This code would _only_ kick in during a clone, and then only on the error path once we knew we had a collision during the checkout step. > Another thing to keep in mind is that the collision could be because > of case folding (or other such nonsense) on a directory in the path. > I mean, if someone on Linux builds a commit containing: > > a/b/c/D/e/foo.txt > a/b/c/d/e/foo.txt > > we'll get a similar collision as if one of them were spelled "FOO.txt". True, though I think that may be OK. If you had conflicting directories you'd get a _ton_ of duplicates listed, but that makes sense: you actually have a ton of duplicates. > Also, do we need to worry about hard-links or symlinks here? I think we can ignore hardlinks. Git never creates them, and we know the directory was empty when we started. Symlinks should be handled by using lstat(). (Obviously that's for a Unix-ish platform). > I'm sure there are other edge cases here that make reporting > difficult; these are just a few I thought of. I guess what I'm > trying to say is that as a first step just report that you found > a collision -- without trying to identify the set existing objects > that it collided with. I certainly don't disagree with that. :) > > At any rate, until we have an actual plan for Windows, I think it would > > make sense only to split the cases into "has working inodes" and > > "other", and make sure "other" does something sensible in the meantime > > (like mention the conflict, but skip trying to list duplicates). > > Yes, this should be split. Do the "easy" Linux version first. > Keep in mind that there may also be a different solution for the Mac. I assumed that an inode-based solution would work for Mac, since it's mostly BSD under the hood. There may be subtleties I don't know about, though. -Peff
Re: [PATCH/RFC] clone: report duplicate entries on case-insensitive filesystems
Jeff Hostetler writes: > Another thing to keep in mind is that the collision could be because > of case folding (or other such nonsense) on a directory in the path. > I mean, if someone on Linux builds a commit containing: > > a/b/c/D/e/foo.txt > a/b/c/d/e/foo.txt > > we'll get a similar collision as if one of them were spelled "FOO.txt". I'd think the approach to teach checkout_entry() codepath to notice it needed to unlink the existing file in order to check out the entry it wanted to check out would cover this equally well. > Also, do we need to worry about hard-links or symlinks here? I do not think so. You do not get a file with multiple hardlinks in a "git clone" or "git checkout" result, and we do not check things out beyond a symbolic link in the first place. > If checkout populates symlinks, then you might have another collision > opportunity. For example: > > a/b/c/D/e/foo.txt > a/link -> ./b/c/d > a/link/e/foo.txt In other words, a tree with a/link (symlink) and a/link/ that requires a/link to be a symlink and a directory at the same time cannot be created, so you won't get one with "git clone" > Also, some platforms (like the Mac) allow directory hard-links. > Granted, Git doesn't create hard-links during checkout, but the > user might. And we'd report "we are doing a fresh checkout immediately after a clone and saw some file we haven't created, which may indicate a case smashing filesystem glitch (or a competing third-party process creating random files)", so noticing that would be a good thing, I would think. > I'm sure there are other edge cases here that make reporting > difficult; these are just a few I thought of. I guess what I'm > trying to say is that as a first step just report that you found > a collision -- without trying to identify the set existing objects > that it collided with. Yup, I think that is sensible. If it can be done cheaply, i.e. on a filesystem with trustable and cheap inum, after noticing such a collision, go back and lstat() all paths in the index we have checked out so far to see which ones are colliding, it adds useful clue to the report, but noticing the collision in the first place obviously has more value.
Re: [PATCH/RFC] clone: report duplicate entries on case-insensitive filesystems
On 8/2/2018 5:28 PM, Jeff King wrote: On Thu, Aug 02, 2018 at 02:14:30PM -0700, Junio C Hamano wrote: Jeff King writes: I also wonder if Windows could return some other file-unique identifier that would work in place of an inode here. That would be pretty easy to swap in via an #ifdef's helper function. I'd be OK shipping without that and letting Windows folks fill it in later (as long as we do not do anything too stupid until then, like claim all of the inode==0 files are the same). Yeah, but such a useful file-unique identifier would probably be used in place of inum in their (l)stat emulation already, if exists, no? Maybe. It might not work as ino_t. Or it might be expensive to get. Or maybe it's simply impossible. I don't know much about Windows. Some searching implies that NTFS does have a "file index" concept which is supposed to be unique. This is hard and/or expensive on Windows. Yes, you can get the "file index" values for an open file handle with a cost similar to an fstat(). Unfortunately, the FindFirst/FindNext routines (equivalent to the opendir/readdir routines), don't give you that data. So we'd have to scan the directory and then open and stat each file. This is terribly expensive on Windows -- and the reason we have the fscache layer (in the GfW version) to intercept the lstat() calls whenever possible. It might be possible to use the NTFS Master File Table to discover this (very big handwave), but I would need to do a little digging. This would all be NTFS specific. FAT and other volume types would not be covered. Another thing to keep in mind is that the collision could be because of case folding (or other such nonsense) on a directory in the path. I mean, if someone on Linux builds a commit containing: a/b/c/D/e/foo.txt a/b/c/d/e/foo.txt we'll get a similar collision as if one of them were spelled "FOO.txt". Also, do we need to worry about hard-links or symlinks here? If checkout populates symlinks, then you might have another collision opportunity. For example: a/b/c/D/e/foo.txt a/link -> ./b/c/d a/link/e/foo.txt Also, some platforms (like the Mac) allow directory hard-links. Granted, Git doesn't create hard-links during checkout, but the user might. I'm sure there are other edge cases here that make reporting difficult; these are just a few I thought of. I guess what I'm trying to say is that as a first step just report that you found a collision -- without trying to identify the set existing objects that it collided with. At any rate, until we have an actual plan for Windows, I think it would make sense only to split the cases into "has working inodes" and "other", and make sure "other" does something sensible in the meantime (like mention the conflict, but skip trying to list duplicates). Yes, this should be split. Do the "easy" Linux version first. Keep in mind that there may also be a different solution for the Mac. When somebody wants to work on Windows support, then we can figure out if it just needs to wrap the "get unique identifier" operation, or if it would use a totally different algorithm. -Peff Jeff
Re: [PATCH/RFC] clone: report duplicate entries on case-insensitive filesystems
On 2018-08-02 15:43, Duy Nguyen wrote: > On Wed, Aug 1, 2018 at 11:20 PM Junio C Hamano wrote: >> >> Junio C Hamano writes: >> >>> Jeff King writes: >>> > Presumably we are already in an error codepath, so if it is > absolutely necessary, then we can issue a lstat() to grab the inum > for the path we are about to create, iterate over the previously > checked out paths issuing lstat() and see which one yields the same > inum, to find the one who is the culprit. Yes, this is the cleverness I was missing in my earlier response. So it seems do-able, and I like that this incurs no cost in the non-error case. >>> >>> Not so fast, unfortunately. >>> >>> I suspect that some filesystems do not give us inum that we can use >>> for that "identity" purpose, and they tend to be the ones with the >>> case smashing characteristics where we need this code in the error >>> path the most X-<. >> >> But even if inum is unreliable, we should be able to use other >> clues, perhaps the same set of fields we use for cached stat >> matching optimization we use for "diff" plumbing commands, to >> implement the error report. The more important part of the idea is >> that we already need to notice that we need to remove a path that is >> in the working tree while doing the checkout, so the alternative >> approach won't incur any extra cost for normal cases where the >> project being checked out does not have two files whose pathnames >> are only different in case (or checking out such an offending >> project to a case sensitive filesytem, of course). >> >> So I guess it still _is_ workable. Any takers? > > OK so we're going back to the original way of checking that we check > out the different files on the same place (because fs is icase) and > try to collect all paths for reporting, yes? I would say: Yes. > I can give it another go > (but of course if anybody else steps up, I'd very gladly hand this > over) > Not at the moment.
Re: [PATCH/RFC] clone: report duplicate entries on case-insensitive filesystems
On Thu, Aug 02, 2018 at 02:14:30PM -0700, Junio C Hamano wrote: > Jeff King writes: > > > I also wonder if Windows could return some other file-unique identifier > > that would work in place of an inode here. That would be pretty easy to > > swap in via an #ifdef's helper function. I'd be OK shipping without that > > and letting Windows folks fill it in later (as long as we do not do > > anything too stupid until then, like claim all of the inode==0 files are > > the same). > > Yeah, but such a useful file-unique identifier would probably be > used in place of inum in their (l)stat emulation already, if exists, > no? Maybe. It might not work as ino_t. Or it might be expensive to get. Or maybe it's simply impossible. I don't know much about Windows. Some searching implies that NTFS does have a "file index" concept which is supposed to be unique. At any rate, until we have an actual plan for Windows, I think it would make sense only to split the cases into "has working inodes" and "other", and make sure "other" does something sensible in the meantime (like mention the conflict, but skip trying to list duplicates). When somebody wants to work on Windows support, then we can figure out if it just needs to wrap the "get unique identifier" operation, or if it would use a totally different algorithm. -Peff
Re: [PATCH/RFC] clone: report duplicate entries on case-insensitive filesystems
Jeff King writes: > I also wonder if Windows could return some other file-unique identifier > that would work in place of an inode here. That would be pretty easy to > swap in via an #ifdef's helper function. I'd be OK shipping without that > and letting Windows folks fill it in later (as long as we do not do > anything too stupid until then, like claim all of the inode==0 files are > the same). Yeah, but such a useful file-unique identifier would probably be used in place of inum in their (l)stat emulation already, if exists, no? > PS It occurs to me that doing this naively (re-scan the entries already >checked out when we see a collision) ends up quadratic over the >number of entries in the worst case. That may not matter. You'd only >have a handful of collisions normally, and anybody malicious can >already git-bomb your checkout anyway. If we care, an alternative >would be to set a flag for "I saw some collisions", and then follow >up with a single pass putting entries into a hashmap of >inode->filename. Yeah, that makes sense.
Re: [PATCH/RFC] clone: report duplicate entries on case-insensitive filesystems
On Thu, Aug 02, 2018 at 09:27:31AM -0700, Junio C Hamano wrote: > > OK so we're going back to the original way of checking that we check > > out the different files on the same place (because fs is icase) and > > try to collect all paths for reporting, yes? I can give it another go > > (but of course if anybody else steps up, I'd very gladly hand this > > over) > > Detect and report, definitely yes; I am not sure about collect all > (personally I am OK if we stopped at reporting "I tried to check out > X but your project tree has something else that is turned to X by > your pathname-smashing filesystem" without making it a requirement > to report what the other one that conflict with X is. Of course, > reporting the other side _is_ nicer and I'd be happier if we can do > so without too much ugly code, but I do not think it is a hard > requirement. Yeah, I think it would be OK to issue the warning for the conflicted path, and then if we _can_ produce the secondary list of colliding paths, do so. Even on a system with working inodes, we may not come up with a match (e.g., if it wasn't us who wrote the file, but rather we raced with some other process). I also wonder if Windows could return some other file-unique identifier that would work in place of an inode here. That would be pretty easy to swap in via an #ifdef's helper function. I'd be OK shipping without that and letting Windows folks fill it in later (as long as we do not do anything too stupid until then, like claim all of the inode==0 files are the same). -Peff PS It occurs to me that doing this naively (re-scan the entries already checked out when we see a collision) ends up quadratic over the number of entries in the worst case. That may not matter. You'd only have a handful of collisions normally, and anybody malicious can already git-bomb your checkout anyway. If we care, an alternative would be to set a flag for "I saw some collisions", and then follow up with a single pass putting entries into a hashmap of inode->filename.
Re: [PATCH/RFC] clone: report duplicate entries on case-insensitive filesystems
Duy Nguyen writes: >> But even if inum is unreliable, we should be able to use other >> clues, perhaps the same set of fields we use for cached stat >> matching optimization we use for "diff" plumbing commands, to >> implement the error report. The more important part of the idea is >> that we already need to notice that we need to remove a path that is >> in the working tree while doing the checkout, so the alternative >> approach won't incur any extra cost for normal cases where the >> project being checked out does not have two files whose pathnames >> are only different in case (or checking out such an offending >> project to a case sensitive filesytem, of course). >> >> So I guess it still _is_ workable. Any takers? > > OK so we're going back to the original way of checking that we check > out the different files on the same place (because fs is icase) and > try to collect all paths for reporting, yes? I can give it another go > (but of course if anybody else steps up, I'd very gladly hand this > over) Detect and report, definitely yes; I am not sure about collect all (personally I am OK if we stopped at reporting "I tried to check out X but your project tree has something else that is turned to X by your pathname-smashing filesystem" without making it a requirement to report what the other one that conflict with X is. Of course, reporting the other side _is_ nicer and I'd be happier if we can do so without too much ugly code, but I do not think it is a hard requirement. Thanks.
Re: [PATCH/RFC] clone: report duplicate entries on case-insensitive filesystems
On Wed, Aug 1, 2018 at 11:20 PM Junio C Hamano wrote: > > Junio C Hamano writes: > > > Jeff King writes: > > > >>> Presumably we are already in an error codepath, so if it is > >>> absolutely necessary, then we can issue a lstat() to grab the inum > >>> for the path we are about to create, iterate over the previously > >>> checked out paths issuing lstat() and see which one yields the same > >>> inum, to find the one who is the culprit. > >> > >> Yes, this is the cleverness I was missing in my earlier response. > >> > >> So it seems do-able, and I like that this incurs no cost in the > >> non-error case. > > > > Not so fast, unfortunately. > > > > I suspect that some filesystems do not give us inum that we can use > > for that "identity" purpose, and they tend to be the ones with the > > case smashing characteristics where we need this code in the error > > path the most X-<. > > But even if inum is unreliable, we should be able to use other > clues, perhaps the same set of fields we use for cached stat > matching optimization we use for "diff" plumbing commands, to > implement the error report. The more important part of the idea is > that we already need to notice that we need to remove a path that is > in the working tree while doing the checkout, so the alternative > approach won't incur any extra cost for normal cases where the > project being checked out does not have two files whose pathnames > are only different in case (or checking out such an offending > project to a case sensitive filesytem, of course). > > So I guess it still _is_ workable. Any takers? OK so we're going back to the original way of checking that we check out the different files on the same place (because fs is icase) and try to collect all paths for reporting, yes? I can give it another go (but of course if anybody else steps up, I'd very gladly hand this over) -- Duy
Re: [PATCH/RFC] clone: report duplicate entries on case-insensitive filesystems
Junio C Hamano writes: > Jeff King writes: > >>> Presumably we are already in an error codepath, so if it is >>> absolutely necessary, then we can issue a lstat() to grab the inum >>> for the path we are about to create, iterate over the previously >>> checked out paths issuing lstat() and see which one yields the same >>> inum, to find the one who is the culprit. >> >> Yes, this is the cleverness I was missing in my earlier response. >> >> So it seems do-able, and I like that this incurs no cost in the >> non-error case. > > Not so fast, unfortunately. > > I suspect that some filesystems do not give us inum that we can use > for that "identity" purpose, and they tend to be the ones with the > case smashing characteristics where we need this code in the error > path the most X-<. But even if inum is unreliable, we should be able to use other clues, perhaps the same set of fields we use for cached stat matching optimization we use for "diff" plumbing commands, to implement the error report. The more important part of the idea is that we already need to notice that we need to remove a path that is in the working tree while doing the checkout, so the alternative approach won't incur any extra cost for normal cases where the project being checked out does not have two files whose pathnames are only different in case (or checking out such an offending project to a case sensitive filesytem, of course). So I guess it still _is_ workable. Any takers?
Re: [PATCH/RFC] clone: report duplicate entries on case-insensitive filesystems
On Tue, Jul 31, 2018 at 8:23 PM Torsten Bögershausen wrote: > I wonder if we can tell the users more about the "problems" > and how to avoid them, or to live with them. > > This is more loud thinking: > > "The following paths only differ in case\n" > "One a case-insensitive file system only one at a time can be present\n" > "You may rename one like this:\n" > "git checkout && git mv .1\n" Jeff gave a couple more options [1] to fix or workaround this. I think the problem is there is no single recommended way to deal with it. If there is, we can describe in this warning. But listing multiple options in this warning may be too much (the wall of text could easily take half a screen). Or if people agree on _one_ suggestion, I will gladly put it in. [1] https://public-inbox.org/git/cacsjy8a_uzm7numyernhjmya0eyrqytv7dp2iklznxnboqu...@mail.gmail.com/T/#m60fedd7dc928a4d52eb5919811f84556f391a7b3 > > + fprintf(stderr, "\t%s\n", > > dup.items[i].string); > > Another question: > Do we need any quote_path() here ? > (This may be overkill, since typically the repos with conflicting names > only use ASCII.) Would be good to show trailing spaces in path names, so yes. -- Duy
Re: [PATCH/RFC] clone: report duplicate entries on case-insensitive filesystems
On Tue, Jul 31, 2018 at 8:44 PM Elijah Newren wrote: > Is it worth attempting to also warn about paths that only differ in > UTF-normalization on relevant MacOS systems? Down this thread, Jeff Hostetler drew a scarier picture of "case" handling on MacOS and Windows. I think we should start with support things like utf normalization... in core.ignore first before doing the warning stuff. At that point we know well how to detect and warn, or any other pitfalls. -- Duy
Re: [PATCH/RFC] clone: report duplicate entries on case-insensitive filesystems
On Tue, Jul 31, 2018 at 9:13 PM Junio C Hamano wrote: > > Nguyễn Thái Ngọc Duy writes: > > > Another thing we probably should do is catch in "git checkout" too, > > not just "git clone" since your linux/unix colleage colleague may > > accidentally add some files that your mac/windows machine is not very > > happy with. > > Then you would catch it not in checkout but in add, no? No because the guy who added these may have done it on a case-sensitive filesystem. Only later when his friend fetches new changes on a case-insensitive filesytem, the problem becomes real. If in this scenario, core.ignore is enforced on all machines, then yes we could catch it at "git add" (and we should already do that or we have a bug). At least in open source project setting, I think enforcing core.ignore will not work. -- Duy
Re: [PATCH/RFC] clone: report duplicate entries on case-insensitive filesystems
Jeff King writes: >> Presumably we are already in an error codepath, so if it is >> absolutely necessary, then we can issue a lstat() to grab the inum >> for the path we are about to create, iterate over the previously >> checked out paths issuing lstat() and see which one yields the same >> inum, to find the one who is the culprit. > > Yes, this is the cleverness I was missing in my earlier response. > > So it seems do-able, and I like that this incurs no cost in the > non-error case. Not so fast, unfortunately. I suspect that some filesystems do not give us inum that we can use for that "identity" purpose, and they tend to be the ones with the case smashing characteristics where we need this code in the error path the most X-<.
Re: [PATCH/RFC] clone: report duplicate entries on case-insensitive filesystems
On Tue, Jul 31, 2018 at 01:12:14PM -0700, Junio C Hamano wrote: > Jeff King writes: > > > On Tue, Jul 31, 2018 at 12:12:58PM -0700, Junio C Hamano wrote: > > ... > >> collapses two (or more) paths if we go that way. We only need to > >> report "we tried to check out X but it seems your filesystem equates > >> something else that is also in the project to X". > > > > Heh. See my similar suggestion in: > > > > https://public-inbox.org/git/20180728095659.ga21...@sigill.intra.peff.net/ > > > > and the response from Duy. > > Yes, but is there a reason why we need to report what that > "something else" is? I don't think it's strictly necessary, but it probably makes things easier for the user. That said... > Presumably we are already in an error codepath, so if it is > absolutely necessary, then we can issue a lstat() to grab the inum > for the path we are about to create, iterate over the previously > checked out paths issuing lstat() and see which one yields the same > inum, to find the one who is the culprit. Yes, this is the cleverness I was missing in my earlier response. So it seems do-able, and I like that this incurs no cost in the non-error case. -Peff
Re: [PATCH/RFC] clone: report duplicate entries on case-insensitive filesystems
Jeff King writes: > On Tue, Jul 31, 2018 at 12:12:58PM -0700, Junio C Hamano wrote: > ... >> collapses two (or more) paths if we go that way. We only need to >> report "we tried to check out X but it seems your filesystem equates >> something else that is also in the project to X". > > Heh. See my similar suggestion in: > > https://public-inbox.org/git/20180728095659.ga21...@sigill.intra.peff.net/ > > and the response from Duy. Yes, but is there a reason why we need to report what that "something else" is? Presumably we are already in an error codepath, so if it is absolutely necessary, then we can issue a lstat() to grab the inum for the path we are about to create, iterate over the previously checked out paths issuing lstat() and see which one yields the same inum, to find the one who is the culprit.
Re: [PATCH/RFC] clone: report duplicate entries on case-insensitive filesystems
On Tue, Jul 31, 2018 at 12:12:58PM -0700, Junio C Hamano wrote: > Elijah Newren writes: > > > Is it worth attempting to also warn about paths that only differ in > > UTF-normalization on relevant MacOS systems? > > I hate to bring up a totally different approach this late in the > party, but I wonder if it makes more sense to take advantage of > "clone" being a command that starts from an empty working tree. > > builtin/clone.c::checkout() drives a single-tree unpack_trees(), > using oneway_merge() as its callback and at the end, eventually > unpack_trees.c:check_updates() will call into checkout_entry() > to perform the usual "unlink and then create" dance. > > I wonder if it makes sense to introduce a new option to tell the > machinery to report when the final checkout_entry() notices that it > needed to remove the working tree file to make room (perhaps that > bit would go in "struct unpack_trees_options"). In the initial > checkout codepath for a freshly cloned repository, that would only > happen when your tree has two (or more) paths that gets smashed > by case insensitive or UTF-normalizing filesystem, and the code we > would maintain do not have to care how exactly the filesystem > collapses two (or more) paths if we go that way. We only need to > report "we tried to check out X but it seems your filesystem equates > something else that is also in the project to X". Heh. See my similar suggestion in: https://public-inbox.org/git/20180728095659.ga21...@sigill.intra.peff.net/ and the response from Duy. -Peff
Re: [PATCH/RFC] clone: report duplicate entries on case-insensitive filesystems
Elijah Newren writes: > Is it worth attempting to also warn about paths that only differ in > UTF-normalization on relevant MacOS systems? I hate to bring up a totally different approach this late in the party, but I wonder if it makes more sense to take advantage of "clone" being a command that starts from an empty working tree. builtin/clone.c::checkout() drives a single-tree unpack_trees(), using oneway_merge() as its callback and at the end, eventually unpack_trees.c:check_updates() will call into checkout_entry() to perform the usual "unlink and then create" dance. I wonder if it makes sense to introduce a new option to tell the machinery to report when the final checkout_entry() notices that it needed to remove the working tree file to make room (perhaps that bit would go in "struct unpack_trees_options"). In the initial checkout codepath for a freshly cloned repository, that would only happen when your tree has two (or more) paths that gets smashed by case insensitive or UTF-normalizing filesystem, and the code we would maintain do not have to care how exactly the filesystem collapses two (or more) paths if we go that way. We only need to report "we tried to check out X but it seems your filesystem equates something else that is also in the project to X".
Re: [PATCH/RFC] clone: report duplicate entries on case-insensitive filesystems
Nguyễn Thái Ngọc Duy writes: > Another thing we probably should do is catch in "git checkout" too, > not just "git clone" since your linux/unix colleage colleague may > accidentally add some files that your mac/windows machine is not very > happy with. Then you would catch it not in checkout but in add, no?
Re: [PATCH/RFC] clone: report duplicate entries on case-insensitive filesystems
On Mon, Jul 30, 2018 at 8:27 AM, Nguyễn Thái Ngọc Duy wrote: > Paths that only differ in case work fine in a case-sensitive > filesystems, but if those repos are cloned in a case-insensitive one, > you'll get problems. The first thing to notice is "git status" will > never be clean with no indication what's exactly is "dirty". "what" rather than "what's"? > This patch helps the situation a bit by pointing out the problem at > clone time. I have not suggested any way to work around or fix this > problem. But I guess we could probably have a section in > Documentation/ dedicated to this problem and point there instead of > a long advice in this warning. > > Another thing we probably should do is catch in "git checkout" too, > not just "git clone" since your linux/unix colleage colleague may drop "colleage", keep "colleague"? > accidentally add some files that your mac/windows machine is not very > happy with. But then there's another problem, once the problem is > known, we probably should stop spamming this warning at every > checkout, but how? Good questions. I have no answers. > > Signed-off-by: Nguyễn Thái Ngọc Duy > --- > builtin/clone.c | 41 + > 1 file changed, 41 insertions(+) > > diff --git a/builtin/clone.c b/builtin/clone.c > index 5c439f1394..32738c2737 100644 > --- a/builtin/clone.c > +++ b/builtin/clone.c > @@ -711,6 +711,33 @@ static void update_head(const struct ref *our, const > struct ref *remote, > } > } > > +static void find_duplicate_icase_entries(struct index_state *istate, > +struct string_list *dup) > +{ > + struct string_list list = STRING_LIST_INIT_NODUP; > + int i; > + > + for (i = 0; i < istate->cache_nr; i++) > + string_list_append(, istate->cache[i]->name); > + > + list.cmp = fspathcmp; > + string_list_sort(); So, you sort the list by fspathcmp to get the entries that differ in case only next to each other. Makes sense... > + > + for (i = 1; i < list.nr; i++) { > + const char *cur = list.items[i].string; > + const char *prev = list.items[i - 1].string; > + > + if (dup->nr && > + !fspathcmp(cur, dup->items[dup->nr - 1].string)) { > + string_list_append(dup, cur); If we have at least one duplicate in dup (and currently we'd have to have at least two), and we now hit yet another (i.e. a third or fourth...) way of spelling the same path, then we add it. > + } else if (!fspathcmp(cur, prev)) { > + string_list_append(dup, prev); > + string_list_append(dup, cur); > + } ...otherwise, if we find a duplicate, we add both spellings to the dup list. > + } > + string_list_clear(, 0); > +} > + > static int checkout(int submodule_progress) > { > struct object_id oid; > @@ -761,6 +788,20 @@ static int checkout(int submodule_progress) > if (write_locked_index(_index, _file, COMMIT_LOCK)) > die(_("unable to write new index file")); > > + if (ignore_case) { > + struct string_list dup = STRING_LIST_INIT_DUP; > + int i; > + > + find_duplicate_icase_entries(_index, ); > + if (dup.nr) { > + warning(_("the following paths in this repository > only differ in case and will\n" Perhaps I'm being excessively pedantic, but what if there are multiple pairs of paths differing in case? E.g. if someone has readme.txt, README.txt, foo.txt, and FOO.txt, you'll list all four files but readme.txt and foo.txt do not only differ in case. Maybe something like "...only differ in case from another path and will... " or is that too verbose and annoying? > + "cause problems because you have cloned it > on an case-insensitive filesytem:\n")); > + for (i = 0; i < dup.nr; i++) > + fprintf(stderr, "\t%s\n", > dup.items[i].string); > + } > + string_list_clear(, 0); > + } > + > err |= run_hook_le(NULL, "post-checkout", sha1_to_hex(null_sha1), >oid_to_hex(), "1", NULL); Is it worth attempting to also warn about paths that only differ in UTF-normalization on relevant MacOS systems?
Re: [PATCH/RFC] clone: report duplicate entries on case-insensitive filesystems
On Mon, Jul 30, 2018 at 05:27:55PM +0200, Nguyễn Thái Ngọc Duy wrote: > Paths that only differ in case work fine in a case-sensitive > filesystems, but if those repos are cloned in a case-insensitive one, > you'll get problems. The first thing to notice is "git status" will > never be clean with no indication what's exactly is "dirty". > > This patch helps the situation a bit by pointing out the problem at > clone time. I have not suggested any way to work around or fix this > problem. But I guess we could probably have a section in > Documentation/ dedicated to this problem and point there instead of > a long advice in this warning. > > Another thing we probably should do is catch in "git checkout" too, > not just "git clone" since your linux/unix colleage colleague may > accidentally add some files that your mac/windows machine is not very > happy with. But then there's another problem, once the problem is > known, we probably should stop spamming this warning at every > checkout, but how? > > Signed-off-by: Nguyễn Thái Ngọc Duy > --- > builtin/clone.c | 41 + > 1 file changed, 41 insertions(+) > > diff --git a/builtin/clone.c b/builtin/clone.c > index 5c439f1394..32738c2737 100644 > --- a/builtin/clone.c > +++ b/builtin/clone.c > @@ -711,6 +711,33 @@ static void update_head(const struct ref *our, const > struct ref *remote, > } > } > > +static void find_duplicate_icase_entries(struct index_state *istate, > + struct string_list *dup) > +{ > + struct string_list list = STRING_LIST_INIT_NODUP; > + int i; > + > + for (i = 0; i < istate->cache_nr; i++) > + string_list_append(, istate->cache[i]->name); > + > + list.cmp = fspathcmp; > + string_list_sort(); > + > + for (i = 1; i < list.nr; i++) { > + const char *cur = list.items[i].string; > + const char *prev = list.items[i - 1].string; > + > + if (dup->nr && > + !fspathcmp(cur, dup->items[dup->nr - 1].string)) { > + string_list_append(dup, cur); > + } else if (!fspathcmp(cur, prev)) { > + string_list_append(dup, prev); > + string_list_append(dup, cur); > + } > + } > + string_list_clear(, 0); > +} > + > static int checkout(int submodule_progress) > { > struct object_id oid; > @@ -761,6 +788,20 @@ static int checkout(int submodule_progress) > if (write_locked_index(_index, _file, COMMIT_LOCK)) > die(_("unable to write new index file")); > > + if (ignore_case) { > + struct string_list dup = STRING_LIST_INIT_DUP; > + int i; > + > + find_duplicate_icase_entries(_index, ); > + if (dup.nr) { > + warning(_("the following paths in this repository only > differ in case and will\n" > + "cause problems because you have cloned it on > an case-insensitive filesytem:\n")); Thanks for the patch. I wonder if we can tell the users more about the "problems" and how to avoid them, or to live with them. This is more loud thinking: "The following paths only differ in case\n" "One a case-insensitive file system only one at a time can be present\n" "You may rename one like this:\n" "git checkout && git mv .1\n" > + fprintf(stderr, "\t%s\n", dup.items[i].string); Another question: Do we need any quote_path() here ? (This may be overkill, since typically the repos with conflicting names only use ASCII.) > + } > + string_list_clear(, 0); > + } > + > err |= run_hook_le(NULL, "post-checkout", sha1_to_hex(null_sha1), > oid_to_hex(), "1", NULL); > > -- > 2.18.0.656.gda699b98b3 >
[PATCH/RFC] clone: report duplicate entries on case-insensitive filesystems
Paths that only differ in case work fine in a case-sensitive filesystems, but if those repos are cloned in a case-insensitive one, you'll get problems. The first thing to notice is "git status" will never be clean with no indication what's exactly is "dirty". This patch helps the situation a bit by pointing out the problem at clone time. I have not suggested any way to work around or fix this problem. But I guess we could probably have a section in Documentation/ dedicated to this problem and point there instead of a long advice in this warning. Another thing we probably should do is catch in "git checkout" too, not just "git clone" since your linux/unix colleage colleague may accidentally add some files that your mac/windows machine is not very happy with. But then there's another problem, once the problem is known, we probably should stop spamming this warning at every checkout, but how? Signed-off-by: Nguyễn Thái Ngọc Duy --- builtin/clone.c | 41 + 1 file changed, 41 insertions(+) diff --git a/builtin/clone.c b/builtin/clone.c index 5c439f1394..32738c2737 100644 --- a/builtin/clone.c +++ b/builtin/clone.c @@ -711,6 +711,33 @@ static void update_head(const struct ref *our, const struct ref *remote, } } +static void find_duplicate_icase_entries(struct index_state *istate, +struct string_list *dup) +{ + struct string_list list = STRING_LIST_INIT_NODUP; + int i; + + for (i = 0; i < istate->cache_nr; i++) + string_list_append(, istate->cache[i]->name); + + list.cmp = fspathcmp; + string_list_sort(); + + for (i = 1; i < list.nr; i++) { + const char *cur = list.items[i].string; + const char *prev = list.items[i - 1].string; + + if (dup->nr && + !fspathcmp(cur, dup->items[dup->nr - 1].string)) { + string_list_append(dup, cur); + } else if (!fspathcmp(cur, prev)) { + string_list_append(dup, prev); + string_list_append(dup, cur); + } + } + string_list_clear(, 0); +} + static int checkout(int submodule_progress) { struct object_id oid; @@ -761,6 +788,20 @@ static int checkout(int submodule_progress) if (write_locked_index(_index, _file, COMMIT_LOCK)) die(_("unable to write new index file")); + if (ignore_case) { + struct string_list dup = STRING_LIST_INIT_DUP; + int i; + + find_duplicate_icase_entries(_index, ); + if (dup.nr) { + warning(_("the following paths in this repository only differ in case and will\n" + "cause problems because you have cloned it on an case-insensitive filesytem:\n")); + for (i = 0; i < dup.nr; i++) + fprintf(stderr, "\t%s\n", dup.items[i].string); + } + string_list_clear(, 0); + } + err |= run_hook_le(NULL, "post-checkout", sha1_to_hex(null_sha1), oid_to_hex(), "1", NULL); -- 2.18.0.656.gda699b98b3