Re: [PATCH 2/2] submodule: munge paths to submodule git directories
On Wed, Aug 29, 2018 at 02:09:13PM -0700, Jonathan Nieder wrote: > Jeff King wrote: > > On Tue, Aug 28, 2018 at 02:35:25PM -0700, Stefan Beller wrote: > > >> Yeah, then let's just convert '/' with as little overhead as possible. > > > > Do you care about case-folding issues (e.g., submodules "FOO" and "foo" > > colliding)? > > > > I'm OK if the answer is "no", but if you do want to deal with it, the > > time is probably now. > > Have we rejected the config approach? I really liked the attribute of > not having to solve everything right away. I'm getting scared that > we've forgotten that goal. I personally have no problem with that approach, but I also haven't thought that hard about it (I was mostly ignoring the discussion since it seemed like submodule-interested folks, but I happened to see what looked like a potentially bad idea cc'd to me ;) ). -Peff
Re: [PATCH 2/2] submodule: munge paths to submodule git directories
On Wed, Aug 29, 2018 at 02:10:37PM -0700, Stefan Beller wrote: > > Yes, that makes even the capitalized "CON" issues go away. It's not a > > one-to-one mapping, though ("foo-" and "foo_" map to the same entity). > > foo_ would map to foo__, and foo- would map to something else. > (foo- as we do not rewrite dashes, yet?) Ah, OK, I took your: > [A-Z] -> _[a-z] to mean "A-Z becomes a-z, and everything else becomes underscore". If you mean a real one-to-one mapping that allows a-z and only a few safe metacharacters, then yeah, that's what I was thinking, too. > > If we want that, too, I think something like url-encoding is fine, with > > the caveat that we simply urlencode _more_ things (i.e., anything not in > > [a-z_]). > > Yeah I think we need more than url encoding now. If you take "url encoding" to only be the mechanical transformation of quoting, not the set of _what_ gets quoting, we can still stick with it. We don't need to, but it's probably no worse than inventing our own set of quoting rules. -Peff
Re: [PATCH 2/2] submodule: munge paths to submodule git directories
On Wed, Aug 29, 2018 at 2:18 PM Jonathan Nieder wrote: > > Hi, > > Stefan Beller wrote: > > >> Yes, that makes even the capitalized "CON" issues go away. It's not a > >> one-to-one mapping, though ("foo-" and "foo_" map to the same entity). > > > > foo_ would map to foo__, and foo- would map to something else. > > (foo- as we do not rewrite dashes, yet?) > > > >> If we want that, too, I think something like url-encoding is fine, with > >> the caveat that we simply urlencode _more_ things (i.e., anything not in > >> [a-z_]). > > > > Yeah I think we need more than url encoding now. > > Can you say more? https://public-inbox.org/git/CAGZ79kZv4BjRq=kq_1UeT2Kn38OZwYFgnMsTe6X_WP41=hb...@mail.gmail.com/ > Can you spell out for me what problem we're solving with something more > custom? case sensitivity for example. > the ability to tweak things later. That is unrelated to the choice of encoding, but more related to https://public-inbox.org/git/20180816181940.46114-1-bmw...@google.com/
Re: [PATCH 2/2] submodule: munge paths to submodule git directories
On 08/29, Stefan Beller wrote: > On Wed, Aug 29, 2018 at 2:09 PM Jonathan Nieder wrote: > > > > Jeff King wrote: > > > On Tue, Aug 28, 2018 at 02:35:25PM -0700, Stefan Beller wrote: > > > > >> Yeah, then let's just convert '/' with as little overhead as possible. > > > > > > Do you care about case-folding issues (e.g., submodules "FOO" and "foo" > > > colliding)? > > > > > > I'm OK if the answer is "no", but if you do want to deal with it, the > > > time is probably now. > > > > Have we rejected the config approach? > > I did not reject that approach, but am rather waiting for patches. ;-) Note I did send out a patch using this approach, so no need to wait any longer! :D https://public-inbox.org/git/20180816181940.46114-1-bmw...@google.com/ -- Brandon Williams
Re: [PATCH 2/2] submodule: munge paths to submodule git directories
Hi, Stefan Beller wrote: >> Yes, that makes even the capitalized "CON" issues go away. It's not a >> one-to-one mapping, though ("foo-" and "foo_" map to the same entity). > > foo_ would map to foo__, and foo- would map to something else. > (foo- as we do not rewrite dashes, yet?) > >> If we want that, too, I think something like url-encoding is fine, with >> the caveat that we simply urlencode _more_ things (i.e., anything not in >> [a-z_]). > > Yeah I think we need more than url encoding now. Can you say more? Perhaps my expectations have been poisoned by tools like dpkg-buildpackage that use urlencode. As far as I can tell, it works fine. Moreover, urlencode has some attributes that make it a good potential fit: it's intuitive, it's unambiguous (yes, it's one-to-many, but at least it's not many-to-many), and people know how to deal with it from their lives using browsers. Can you spell out for me what problem we're solving with something more custom? Stepping back, I am very worried about any design that doesn't give us the ability to tweak things later. See [1] and [2] for more on that subject. Thanks, Jonathan [1] https://public-inbox.org/git/20180816023446.ga127...@aiede.svl.corp.google.com/ [2] https://public-inbox.org/git/20180829210913.gf7...@aiede.svl.corp.google.com/
Re: [PATCH 2/2] submodule: munge paths to submodule git directories
On Wed, Aug 29, 2018 at 2:09 PM Jonathan Nieder wrote: > > Jeff King wrote: > > On Tue, Aug 28, 2018 at 02:35:25PM -0700, Stefan Beller wrote: > > >> Yeah, then let's just convert '/' with as little overhead as possible. > > > > Do you care about case-folding issues (e.g., submodules "FOO" and "foo" > > colliding)? > > > > I'm OK if the answer is "no", but if you do want to deal with it, the > > time is probably now. > > Have we rejected the config approach? I did not reject that approach, but am rather waiting for patches. ;-) > I really liked the attribute of > not having to solve everything right away. I'm getting scared that > we've forgotten that goal. Eh, sorry for side tracking this issue. I am just under the impression that the URL encoding is not particularly good for our use case as it solves just one out of many things, whereas the one thing (having no slashes) can also be solved in an easier way. > It mixes well with Stefan's idea of setting up a new .git/submodules/ > directory. We could require that everything in .git/submodules/ have > configuration (or that everything in that directory either have > configuration or be the result of a "very simple" transformation) and > that way, all ambiguity goes away. I would not want to have a world where we require that config, but I would agree to the latter, hence we would need to discuss "very simple". I guess that are 2 or 3 rules at most. > Part of the definition of "very simple" could be that the submodule > name must consist of some whitelisted list of characters (including no > uppercase), for example. Good catch! Thanks for chiming in, Stefan
Re: [PATCH 2/2] submodule: munge paths to submodule git directories
> Yes, that makes even the capitalized "CON" issues go away. It's not a > one-to-one mapping, though ("foo-" and "foo_" map to the same entity). foo_ would map to foo__, and foo- would map to something else. (foo- as we do not rewrite dashes, yet?) > > If we want that, too, I think something like url-encoding is fine, with > the caveat that we simply urlencode _more_ things (i.e., anything not in > [a-z_]). Yeah I think we need more than url encoding now.
Re: [PATCH 2/2] submodule: munge paths to submodule git directories
Jeff King wrote: > On Tue, Aug 28, 2018 at 02:35:25PM -0700, Stefan Beller wrote: >> Yeah, then let's just convert '/' with as little overhead as possible. > > Do you care about case-folding issues (e.g., submodules "FOO" and "foo" > colliding)? > > I'm OK if the answer is "no", but if you do want to deal with it, the > time is probably now. Have we rejected the config approach? I really liked the attribute of not having to solve everything right away. I'm getting scared that we've forgotten that goal. It mixes well with Stefan's idea of setting up a new .git/submodules/ directory. We could require that everything in .git/submodules/ have configuration (or that everything in that directory either have configuration or be the result of a "very simple" transformation) and that way, all ambiguity goes away. Part of the definition of "very simple" could be that the submodule name must consist of some whitelisted list of characters (including no uppercase), for example. Thanks, Jonathan
Re: [PATCH 2/2] submodule: munge paths to submodule git directories
On Wed, Aug 29, 2018 at 11:10:51AM -0700, Stefan Beller wrote: > > Do you care about case-folding issues (e.g., submodules "FOO" and "foo" > > colliding)? > > I do. :( > > 2d84f13dcb6 (config: fix case sensitive subsection names on writing, > 2018-08-08) > explains the latest episode of case folding with submodules involved. > > > I'm OK if the answer is "no", but if you do want to deal with it, the > > time is probably now. > > Good point. But as soon as we start discussing case sensitivity, we > are drawn down the rabbit hole of funny file names. (Try naming > a submodule "CON1" and obtain it on Windows for example) > So we would need to have a file system specific encoding function for > submodule names, which sounds like a maintenance night mare. Hmph. I'd hoped that simply escaping metacharacters and doing some obvious case-folding would be enough. And I think that would cover most accidental cases. But yeah, Windows reserved names are basically indistinguishable from reasonable names. They'd probably need special-cased. OTOH, I'm not sure how we handle those for entries in the actual tree. Poking around git-for-windows/git, I think it uses the magic "\\?" marker to tell the OS to interpret the name literally. So I wonder if it might be sufficient to just deal with the more obvious folding issues. Or as you noted, if we just choose lowercase names as the normalized form, that might also be enough. :) > So if I was thinking in the scheme presented above, we could just > have another rule that is > > [A-Z] -> _[a-z] > > (lowercase capital letters and escape them with an underscore) Yes, that makes even the capitalized "CON" issues go away. It's not a one-to-one mapping, though ("foo-" and "foo_" map to the same entity). If we want that, too, I think something like url-encoding is fine, with the caveat that we simply urlencode _more_ things (i.e., anything not in [a-z_]). -Peff
Re: [PATCH 2/2] submodule: munge paths to submodule git directories
On Tue, Aug 28, 2018 at 10:25 PM Jeff King wrote: > > On Tue, Aug 28, 2018 at 02:35:25PM -0700, Stefan Beller wrote: > > > 3) (optional) instead of putting it all in modules/, use another > >directory gitmodules/ for example. this will make sure we can tell > >if a repository has been converted or is stuck with a setup of a > >current git. > > I actually kind of like that idea, as it makes the interaction between > old and new names much simpler to reason about. > > And since old code won't know about the new names anyway, there's in > theory no downside. In practice, of course, the encoding may often be a > noop, and lazy scripts would continue to work most of the time if you > didn't change out the prefix directory. I'm not sure if that is an > argument for the scheme (because it will suss out broken scripts more > consistently) or against it (because 99% of the time those old scripts > would just happen to work). > > > > This is exactly the reason why I wanted to get some opinions on what the > > > best thing to do here would be. I _think_ the best thing would probably > > > be to write a specific routine to do the conversion, and it wouldn't > > > even have to be all that complex. Basically I'm just interested in > > > converting '/' characters so that things no longer behave like > > > nested directories. > > > > Yeah, then let's just convert '/' with as little overhead as possible. > > Do you care about case-folding issues (e.g., submodules "FOO" and "foo" > colliding)? I do. :( 2d84f13dcb6 (config: fix case sensitive subsection names on writing, 2018-08-08) explains the latest episode of case folding with submodules involved. > I'm OK if the answer is "no", but if you do want to deal with it, the > time is probably now. Good point. But as soon as we start discussing case sensitivity, we are drawn down the rabbit hole of funny file names. (Try naming a submodule "CON1" and obtain it on Windows for example) So we would need to have a file system specific encoding function for submodule names, which sounds like a maintenance night mare. The CON1 example shows that URL encoding may not be enough on Windows and we'd have to extend the encoding if we care about FS issues. Another example would be "a" and "a\b" which would be a mess in Windows as the '\' would work as a dir separator whereas these two names were ok on linux. This would be fixed with url encoding. URL encoding would not fix the case-folding issue that you mentioned above. So if I was thinking in the scheme presented above, we could just have another rule that is [A-Z] -> _[a-z] (lowercase capital letters and escape them with an underscore) But with that rule added, we are inventing a really complicated encoding scheme already.
Re: [PATCH 2/2] submodule: munge paths to submodule git directories
On Tue, Aug 28, 2018 at 02:35:25PM -0700, Stefan Beller wrote: > 3) (optional) instead of putting it all in modules/, use another >directory gitmodules/ for example. this will make sure we can tell >if a repository has been converted or is stuck with a setup of a >current git. I actually kind of like that idea, as it makes the interaction between old and new names much simpler to reason about. And since old code won't know about the new names anyway, there's in theory no downside. In practice, of course, the encoding may often be a noop, and lazy scripts would continue to work most of the time if you didn't change out the prefix directory. I'm not sure if that is an argument for the scheme (because it will suss out broken scripts more consistently) or against it (because 99% of the time those old scripts would just happen to work). > > This is exactly the reason why I wanted to get some opinions on what the > > best thing to do here would be. I _think_ the best thing would probably > > be to write a specific routine to do the conversion, and it wouldn't > > even have to be all that complex. Basically I'm just interested in > > converting '/' characters so that things no longer behave like > > nested directories. > > Yeah, then let's just convert '/' with as little overhead as possible. Do you care about case-folding issues (e.g., submodules "FOO" and "foo" colliding)? I'm OK if the answer is "no", but if you do want to deal with it, the time is probably now. -Peff
Re: [PATCH 2/2] submodule: munge paths to submodule git directories
> > > - echo "gitdir: > > > ../../../.git/modules/sub3/modules/dirdir/subsub" > > > >./sub3/dirdir/subsub/.git_expect > > > + echo "gitdir: > > > ../../../.git/modules/sub3/modules/dirdir%2fsubsub" > > > >./sub3/dirdir/subsub/.git_expect > > > > One interesting thing about url-encoding is that it's not one-to-one. > > This case could also be %2F, which is a different file (on a > > case-sensitive filesystem). I think "%20" and "+" are similarly > > interchangeable. > > > > If we were decoding the filenames, that's fine. The round-trip is > > lossless. > > > > But that's not quite how the new code behaves. We encode the input and > > then check to see if it matches an encoding we previously performed. So > > if our urlencode routines ever change, this will subtly break. And this is the problem: a) we have a 'complicated' encoding here, which must never change b) the "encode and check if it matches", will produce ugly code going forward, as it tries to differentiate between submodules named "url_encoded(a)" and "a" (e.g. "a%20b" and "a b" would conflict and we have to resolve the conflict, although those two names are perfectly fine as they do not have the original problem of having slashes) Hence I would propose a simpler encoding: 1)/ -> _ ( replace a slash by an underscore) 2)_ -> __ (replace any underscore by 2 underscores, this is just the escaping mechanism to differentiate a/b and a_b) 3) (optional) instead of putting it all in modules/, use another directory gitmodules/ for example. this will make sure we can tell if a repository has been converted or is stuck with a setup of a current git. > This is exactly the reason why I wanted to get some opinions on what the > best thing to do here would be. I _think_ the best thing would probably > be to write a specific routine to do the conversion, and it wouldn't > even have to be all that complex. Basically I'm just interested in > converting '/' characters so that things no longer behave like > nested directories. Yeah, then let's just convert '/' with as little overhead as possible. Thanks, Stefan
Re: [PATCH 2/2] submodule: munge paths to submodule git directories
On 08/15, Jonathan Nieder wrote: > Stefan Beller wrote: > > Jonathan Nieder wrote: > > >> All at the cost of recording a little configuration somewhere. If we > >> want to decrease the configuration, we can avoid recording it there in > >> the easy cases (e.g. when name == gitdirname). That's "just" an > >> optimization. > > > > Sounds good, but gerrit for example would not take advantage of such > > optimisation as they have slashes in their submodules. :-( > > I wonder if we can optimize further and keep slashes if there is > > no conflict (as then name == gitdirname, so it can be optimized). > > One possibility would be to treat gsub("/", "%2f") as another of the > easy cases. > > [...] > >> And then we have the ability later to handle all the edge cases we > >> haven't handled yet today: > >> > >> - sharding when the number of submodules is too large > >> - case-insensitive filesystems > >> - path name length limits > >> - different sets of filesystem-special characters > >> > >> Sane? Seems like a sensible thing to do. Let me work up some patches to implement this using config primarily and these other schemes as fallbacks. > > > > I'll keep thinking about it. > > Thanks. > > > FYI: the reduction in configuration was just sent out. > > https://public-inbox.org/git/20180816023100.161626-1-sbel...@google.com/ > for those following along. > > Ciao, > Jonathan -- Brandon Williams
Re: [PATCH 2/2] submodule: munge paths to submodule git directories
Jonathan Nieder writes: > So it would be nice, for future-proofing, if we can change the naming > scheme later. > ... > All at the cost of recording a little configuration somewhere. If we > want to decrease the configuration, we can avoid recording it there in > the easy cases (e.g. when name == gitdirname). That's "just" an > optimization. > > And then we have the ability later to handle all the edge cases we > haven't handled yet today: > > - sharding when the number of submodules is too large > - case-insensitive filesystems > - path name length limits > - different sets of filesystem-special characters > > Sane? Yup.
Re: [PATCH 2/2] submodule: munge paths to submodule git directories
Stefan Beller wrote: > Jonathan Nieder wrote: >> All at the cost of recording a little configuration somewhere. If we >> want to decrease the configuration, we can avoid recording it there in >> the easy cases (e.g. when name == gitdirname). That's "just" an >> optimization. > > Sounds good, but gerrit for example would not take advantage of such > optimisation as they have slashes in their submodules. :-( > I wonder if we can optimize further and keep slashes if there is > no conflict (as then name == gitdirname, so it can be optimized). One possibility would be to treat gsub("/", "%2f") as another of the easy cases. [...] >> And then we have the ability later to handle all the edge cases we >> haven't handled yet today: >> >> - sharding when the number of submodules is too large >> - case-insensitive filesystems >> - path name length limits >> - different sets of filesystem-special characters >> >> Sane? > > I'll keep thinking about it. Thanks. > FYI: the reduction in configuration was just sent out. https://public-inbox.org/git/20180816023100.161626-1-sbel...@google.com/ for those following along. Ciao, Jonathan
Re: [PATCH 2/2] submodule: munge paths to submodule git directories
> [...] all good reasons; ship it :-) > All at the cost of recording a little configuration somewhere. If we > want to decrease the configuration, we can avoid recording it there in > the easy cases (e.g. when name == gitdirname). That's "just" an > optimization. Sounds good, but gerrit for example would not take advantage of such optimisation as they have slashes in their submodules. :-( I wonder if we can optimize further and keep slashes if there is no conflict (as then name == gitdirname, so it can be optimized). > And then we have the ability later to handle all the edge cases we > haven't handled yet today: > > - sharding when the number of submodules is too large > - case-insensitive filesystems > - path name length limits > - different sets of filesystem-special characters > > Sane? I'll keep thinking about it. FYI: the reduction in configuration was just sent out. Thanks, Stefan
Re: [PATCH 2/2] submodule: munge paths to submodule git directories
Hi again, Stefan Beller wrote: > On Tue, Aug 14, 2018 at 2:12 PM Jonathan Nieder wrote: >> What if we forbid directory separator characters in the gitdirname? > > Fine with me, but ideally we'd want to allow sharding the > submodules. When you have 1000 submodules > we'd want them not all inside the toplevel "modules/" ? That's a good reason to permit slashes in the gitdirname. If I understood the rest of your reply correctly, your worry was about dangerous gitdirname values in .gitmodules. I never had any wish to read them from there anyway, so this worry hopefully goes away. [...] >> In this proposal, it would only be read from config, not from >> .gitmodules. > > Ah good point. That makes sense. > > Stepping back a bit regarding the config: [...] > Now that we have the submodule.active or even > submodule..active flags, we do not need (b) any more. > So the URL turns into a useless piece of cruft that just is unneeded > and might confuse the user. > > So maybe I'd want to propose a patch that removes > submodule..url from the config once it is cloned. > (I just read up on "submodule sync" again, but that might not > even need special care for this new world) > > And with all that said, I think if we can avoid having the submodules > gitdir in the config, the config would look much cleaner, too. Yes, I understand and agree with this. I should further spell out my motivation with this gitdirname suggestion. The issue that some people have mentioned in this thread is that urlencoding might not be perfect --- it's pretty close to perfect, but it's likely we'll come up with some unanticipated needs later (like sharding) that it doesn't solve. Solving those all right now would not necessarily be wise, since the thing about unanticipated needs is that you never know in advance what they will be. ;-) So it would be nice, for future-proofing, if we can change the naming scheme later. As a bonus, that would also make interoperability with other implementations easier. For example, suppose we mess up in JGit and urlencode a different set of characters than Git does. Then a mixed Git + JGit installation would have this subtle bug of the submodule .git directory not being reused when I switch to and from and branch not containing that submodule, in some circumstances. That sounds difficult to support. Whereas if we have a gitdirname configuration variable, then JGit and libgit2 and go-git do not have to match the naming scheme Git chooses. They can try, but if one gets it subtly wrong then that is okay because the submodule's directory name is right there and easy to look up. All at the cost of recording a little configuration somewhere. If we want to decrease the configuration, we can avoid recording it there in the easy cases (e.g. when name == gitdirname). That's "just" an optimization. And then we have the ability later to handle all the edge cases we haven't handled yet today: - sharding when the number of submodules is too large - case-insensitive filesystems - path name length limits - different sets of filesystem-special characters Sane? Thanks, Jonathan
Re: [PATCH 2/2] submodule: munge paths to submodule git directories
At 15:33 -0700 08 Aug 2018, Brandon Williams wrote: Teach "submodule_name_to_gitdir()" to munge a submodule's name (by url encoding it) before using it to build a path to the submodule's gitdir. Seems like this will be a problem if it results in names that exceed NAME_MAX? On common systems that's 255, so it's probably not going to be common; but it certainly could for some repositories.
Re: [PATCH 2/2] submodule: munge paths to submodule git directories
On Tue, Aug 14, 2018 at 2:12 PM Jonathan Nieder wrote: > > Hi, > > Stefan Beller wrote: > > On Tue, Aug 14, 2018 at 11:57 AM Jonathan Nieder wrote: > > >> Second, what if we store the pathname in config? We already store the > >> URL there: > >> > >> [submodule "plugins/hooks"] > >> url = https://gerrit.googlesource.com/plugins/hooks > >> > >> So we could (as a followup patch) do something like > >> > >> [submodule "plugins/hooks"] > >> url = https://gerrit.googlesource.com/plugins/hooks > >> gitdirname = plugins%2fhooks > >> > >> and use that for lookups instead of regenerating the directory name. > >> What do you think? > > > > As I just looked at worktree code, this sounds intriguing for the wrong > > reason (again), as a user may want to point the gitdirname to a repository > > that they have already on disk outside the actual superproject. They > > would be reinventing worktrees in the submodule space. ;-) > > > > This would open up the security hole that we just had, again. > > So we'd have to make sure that the gitdirname (instead of the > > now meaningless subsection name) is proof to ../ attacks. > > > > I feel uneasy about this as then the user might come in > > and move submodules and repoint the gitdirname... > > to a not url encoded path. Exposing this knob just > > asks for trouble, no? > > What if we forbid directory separator characters in the gitdirname? Fine with me, but ideally we'd want to allow sharding the submodules. When you have 1000 submodules we'd want them not all inside the toplevel "modules/" ? Up to now we could just wave hands and claim the user (who is clearly experienced with submodules as they use so many of them) would shard it properly. With this scheme we loose the ability to shard. > [...] > > What would happen if gitdirname is changed as part of > > history? (The same problem we have now with changing > > the subsection name) > > In this proposal, it would only be read from config, not from > .gitmodules. Ah good point. That makes sense. Stepping back a bit regarding the config: When I clone gerrit (or any repo using submodules) $ git clone --recurse-submodules \ https://gerrit.googlesource.com/gerrit g2 [...] $ cat g2/.git/config [submodule] active = . [submodule "plugins/codemirror-editor"] url = https://gerrit.googlesource.com/plugins/codemirror-editor [... more urls to follow...] Originally we have had the url in the config, (a) that we can change the URLs after the "git submodule init" and "git submodule update" step that actually clones the submodule if not present and much more importantly (b) to know which submodule "was initialized/active". Now that we have the submodule.active or even submodule..active flags, we do not need (b) any more. So the URL turns into a useless piece of cruft that just is unneeded and might confuse the user. So maybe I'd want to propose a patch that removes submodule..url from the config once it is cloned. (I just read up on "submodule sync" again, but that might not even need special care for this new world) And with all that said, I think if we can avoid having the submodules gitdir in the config, the config would look much cleaner, too. But maybe that is the wrong thing to optimize for. ;-) It just demonstrates that we'd have a submodule specific thing again in the config. So my preference would be to do a similar thing as url-encoding as that solves the issue of slashes and potentially of case sensitivity (e.g. encode upper case A as lower case with underscore _a) However the transition worries me, as it transitions within the same namespace. Back then when we transferred from the .git dir inside the submodules working tree to the embedded version in the superprojects .git dir, there was no overlap, and any potential directory in .git/modules/ that was already there, was highly unusual, so asking the user for help is the reasonable thing to do. But now we might run into issues that has overlap between old(name as is) and new (urlencoded) world. So maybe we also want to transition from modules/ to submodules/)> Thanks, Stefan
Re: [PATCH 2/2] submodule: munge paths to submodule git directories
Hi, Stefan Beller wrote: > On Tue, Aug 14, 2018 at 11:57 AM Jonathan Nieder wrote: >> Second, what if we store the pathname in config? We already store the >> URL there: >> >> [submodule "plugins/hooks"] >> url = https://gerrit.googlesource.com/plugins/hooks >> >> So we could (as a followup patch) do something like >> >> [submodule "plugins/hooks"] >> url = https://gerrit.googlesource.com/plugins/hooks >> gitdirname = plugins%2fhooks >> >> and use that for lookups instead of regenerating the directory name. >> What do you think? > > As I just looked at worktree code, this sounds intriguing for the wrong > reason (again), as a user may want to point the gitdirname to a repository > that they have already on disk outside the actual superproject. They > would be reinventing worktrees in the submodule space. ;-) > > This would open up the security hole that we just had, again. > So we'd have to make sure that the gitdirname (instead of the > now meaningless subsection name) is proof to ../ attacks. > > I feel uneasy about this as then the user might come in > and move submodules and repoint the gitdirname... > to a not url encoded path. Exposing this knob just > asks for trouble, no? What if we forbid directory separator characters in the gitdirname? [...] > What would happen if gitdirname is changed as part of > history? (The same problem we have now with changing > the subsection name) In this proposal, it would only be read from config, not from .gitmodules. Thanks, Jonathan
Re: [PATCH 2/2] submodule: munge paths to submodule git directories
On Tue, Aug 14, 2018 at 11:57 AM Jonathan Nieder wrote: > > Hi, > > Brandon Williams wrote: > > On 08/09, Jeff King wrote: > > >> One interesting thing about url-encoding is that it's not one-to-one. > >> This case could also be %2F, which is a different file (on a > >> case-sensitive filesystem). I think "%20" and "+" are similarly > >> interchangeable. > >> > >> If we were decoding the filenames, that's fine. The round-trip is > >> lossless. > >> > >> But that's not quite how the new code behaves. We encode the input and > >> then check to see if it matches an encoding we previously performed. So > >> if our urlencode routines ever change, this will subtly break. > >> > >> I don't know how much it's worth caring about. We're not that likely to > >> change the routines ourself (though certainly a third-party > >> implementation would need to know our exact url-encoding decisions). > > > > This is exactly the reason why I wanted to get some opinions on what the > > best thing to do here would be. I _think_ the best thing would probably > > be to write a specific routine to do the conversion, and it wouldn't > > even have to be all that complex. Basically I'm just interested in > > converting '/' characters so that things no longer behave like > > nested directories. > > First of all, I think the behavior with this patch is already much > better than the previous status quo. I'm using the patch now and am > very happy with it. > > Second, what if we store the pathname in config? We already store the > URL there: > > [submodule "plugins/hooks"] > url = https://gerrit.googlesource.com/plugins/hooks > > So we could (as a followup patch) do something like > > [submodule "plugins/hooks"] > url = https://gerrit.googlesource.com/plugins/hooks > gitdirname = plugins%2fhooks > > and use that for lookups instead of regenerating the directory name. > What do you think? As I just looked at worktree code, this sounds intriguing for the wrong reason (again), as a user may want to point the gitdirname to a repository that they have already on disk outside the actual superproject. They would be reinventing worktrees in the submodule space. ;-) This would open up the security hole that we just had, again. So we'd have to make sure that the gitdirname (instead of the now meaningless subsection name) is proof to ../ attacks. I feel uneasy about this as then the user might come in and move submodules and repoint the gitdirname... to a not url encoded path. Exposing this knob just asks for trouble, no? On the other hand, the only requirement for the "name" is now uniqueness, and that is implied with subsections, so I guess it looks elegant. What would happen if gitdirname is changed as part of history? (The same problem we have now with changing the subsection name) The more I think about it the less appealing this is, but it looks elegant. Stefan
Re: [PATCH 2/2] submodule: munge paths to submodule git directories
On Tue, Aug 14, 2018 at 11:04:06AM -0700, Brandon Williams wrote: > > I think this backwards-compatibility is necessary to avoid pain. But > > until it goes away, I don't think this is helping the vulnerability from > > 0383bbb901. Because there the issue was that the submodule name pointed > > back into the working tree, so this access() would find the untrusted > > working tree code and say "ah, an old-fashioned name!". > [...] > > Oh I know that this doesn't help with that vulnerability. As you've > said we fix it and now disallow ".." at the submodule-config level so > really this path is simply about using what we get out of > submodule-config in a more sane manor. OK, I'm alright with that as long as we are all on the same page. I think I mistook "this addresses the vulnerability" from your commit message the wrong way. I took it as "this patch", but reading it again, you simply mean "the '..' handling we already did". I do think eventually dropping this back-compatibility could save us from another directory-escape problem, but it's hard to justify the real-world pain for a hypothetical benefit. Maybe in a few years we could get rid of it in a major version bump. > > One interesting thing about url-encoding is that it's not one-to-one. > > This case could also be %2F, which is a different file (on a > > case-sensitive filesystem). I think "%20" and "+" are similarly > > interchangeable. > > > > If we were decoding the filenames, that's fine. The round-trip is > > lossless. > > > > But that's not quite how the new code behaves. We encode the input and > > then check to see if it matches an encoding we previously performed. So > > if our urlencode routines ever change, this will subtly break. > > > > I don't know how much it's worth caring about. We're not that likely to > > change the routines ourself (though certainly a third-party > > implementation would need to know our exact url-encoding decisions). > > This is exactly the reason why I wanted to get some opinions on what the > best thing to do here would be. I _think_ the best thing would probably > be to write a specific routine to do the conversion, and it wouldn't > even have to be all that complex. Basically I'm just interested in > converting '/' characters so that things no longer behave like > nested directories. I think we benefit from catching names that would trigger filesystem case-folding, too. If I have submodules with names "foo" and "FOO", we would not want to confuse them (or at least we should confuse them equally on all platforms). I doubt you can do anything malicious, but it might simply be annoying. That implies to me using a custom function (even if its encoded form ends up being understandable as url-encoding). -Peff
Re: [PATCH 2/2] submodule: munge paths to submodule git directories
Hi, Brandon Williams wrote: > On 08/09, Jeff King wrote: >> One interesting thing about url-encoding is that it's not one-to-one. >> This case could also be %2F, which is a different file (on a >> case-sensitive filesystem). I think "%20" and "+" are similarly >> interchangeable. >> >> If we were decoding the filenames, that's fine. The round-trip is >> lossless. >> >> But that's not quite how the new code behaves. We encode the input and >> then check to see if it matches an encoding we previously performed. So >> if our urlencode routines ever change, this will subtly break. >> >> I don't know how much it's worth caring about. We're not that likely to >> change the routines ourself (though certainly a third-party >> implementation would need to know our exact url-encoding decisions). > > This is exactly the reason why I wanted to get some opinions on what the > best thing to do here would be. I _think_ the best thing would probably > be to write a specific routine to do the conversion, and it wouldn't > even have to be all that complex. Basically I'm just interested in > converting '/' characters so that things no longer behave like > nested directories. First of all, I think the behavior with this patch is already much better than the previous status quo. I'm using the patch now and am very happy with it. Second, what if we store the pathname in config? We already store the URL there: [submodule "plugins/hooks"] url = https://gerrit.googlesource.com/plugins/hooks So we could (as a followup patch) do something like [submodule "plugins/hooks"] url = https://gerrit.googlesource.com/plugins/hooks gitdirname = plugins%2fhooks and use that for lookups instead of regenerating the directory name. What do you think? Thanks, Jonathan
Re: [PATCH 2/2] submodule: munge paths to submodule git directories
On 08/09, Jeff King wrote: > On Wed, Aug 08, 2018 at 03:33:23PM -0700, Brandon Williams wrote: > > > Commit 0383bbb901 (submodule-config: verify submodule names as paths, > > 2018-04-30) introduced some checks to ensure that submodule names don't > > include directory traversal components (e.g. "../"). > > > > This addresses the vulnerability identified in 0383bbb901 but the root > > cause is that we use submodule names to construct paths to the > > submodule's git directory. What we really should do is munge the > > submodule name before using it to construct a path. > > > > Teach "submodule_name_to_gitdir()" to munge a submodule's name (by url > > encoding it) before using it to build a path to the submodule's gitdir. > > I like this approach very much, and I think using url encoding is much > better than an opaque hash (purely because it makes debugging and > inspection saner). > > Two thoughts, though: > > > + modules_len = buf->len; > > strbuf_addstr(buf, submodule_name); > > + > > + /* > > +* If the submodule gitdir already exists using the old-fashioned > > +* location (which uses the submodule name as-is, without munging it) > > +* then return that. > > +*/ > > + if (!access(buf->buf, F_OK)) > > + return; > > I think this backwards-compatibility is necessary to avoid pain. But > until it goes away, I don't think this is helping the vulnerability from > 0383bbb901. Because there the issue was that the submodule name pointed > back into the working tree, so this access() would find the untrusted > working tree code and say "ah, an old-fashioned name!". > > In theory a fresh clone could set a config option for "I only speak > use new-style modules". And there could even be a conversion program > that moves the modules as appropriate, fixes up the .git files in the > working tree, and then sets that config. > > In fact, I think that config option _could_ be done by bumping > core.repositoryformatversion and then setting extensions.submodulenames > to "url" or something. Then you could never run into the confusing case > where you have a clone done by a new version of git (using new-style > names), but using an old-style version gets confused because it can't > find the module directories (instead, it would barf and say "I don't > know about that extension"). > > I don't know if any of that is worth it, though. We already fixed the > problem from 0383bbb901. There may be a _different_ "break out of the > modules directory" vulnerability, but since we disallow ".." it's hard > to see what it would be (the best I could come up with is maybe pointing > one module into the interior of another module, but I think you'd have > to trouble overwriting anything useful). > > And while an old-style version of Git being confused might be annoying, > I suspect that bumping the repository version would be even _more_ > annoying, because it would hit every command, not just ones that try to > touch those submodules. Oh I know that this doesn't help with that vulnerability. As you've said we fix it and now disallow ".." at the submodule-config level so really this path is simply about using what we get out of submodule-config in a more sane manor. > > > diff --git a/t/t7400-submodule-basic.sh b/t/t7400-submodule-basic.sh > > index 2c2c97e144..963693332c 100755 > > --- a/t/t7400-submodule-basic.sh > > +++ b/t/t7400-submodule-basic.sh > > @@ -933,7 +933,7 @@ test_expect_success 'recursive relative submodules stay > > relative' ' > > cd clone2 && > > git submodule update --init --recursive && > > echo "gitdir: ../.git/modules/sub3" >./sub3/.git_expect && > > - echo "gitdir: ../../../.git/modules/sub3/modules/dirdir/subsub" > > >./sub3/dirdir/subsub/.git_expect > > + echo "gitdir: > > ../../../.git/modules/sub3/modules/dirdir%2fsubsub" > > >./sub3/dirdir/subsub/.git_expect > > One interesting thing about url-encoding is that it's not one-to-one. > This case could also be %2F, which is a different file (on a > case-sensitive filesystem). I think "%20" and "+" are similarly > interchangeable. > > If we were decoding the filenames, that's fine. The round-trip is > lossless. > > But that's not quite how the new code behaves. We encode the input and > then check to see if it matches an encoding we previously performed. So > if our urlencode routines ever change, this will subtly break. > > I don't know how much it's worth caring about. We're not that likely to > change the routines ourself (though certainly a third-party > implementation would need to know our exact url-encoding decisions). This is exactly the reason why I wanted to get some opinions on what the best thing to do here would be. I _think_ the best thing would probably be to write a specific routine to do the conversion, and it wouldn't even have to be all that complex. Basically I'm just interested in converting '/' characters so that things no
Re: [PATCH 2/2] submodule: munge paths to submodule git directories
On Wed, Aug 08, 2018 at 03:33:23PM -0700, Brandon Williams wrote: > Commit 0383bbb901 (submodule-config: verify submodule names as paths, > 2018-04-30) introduced some checks to ensure that submodule names don't > include directory traversal components (e.g. "../"). > > This addresses the vulnerability identified in 0383bbb901 but the root > cause is that we use submodule names to construct paths to the > submodule's git directory. What we really should do is munge the > submodule name before using it to construct a path. > > Teach "submodule_name_to_gitdir()" to munge a submodule's name (by url > encoding it) before using it to build a path to the submodule's gitdir. I like this approach very much, and I think using url encoding is much better than an opaque hash (purely because it makes debugging and inspection saner). Two thoughts, though: > + modules_len = buf->len; > strbuf_addstr(buf, submodule_name); > + > + /* > + * If the submodule gitdir already exists using the old-fashioned > + * location (which uses the submodule name as-is, without munging it) > + * then return that. > + */ > + if (!access(buf->buf, F_OK)) > + return; I think this backwards-compatibility is necessary to avoid pain. But until it goes away, I don't think this is helping the vulnerability from 0383bbb901. Because there the issue was that the submodule name pointed back into the working tree, so this access() would find the untrusted working tree code and say "ah, an old-fashioned name!". In theory a fresh clone could set a config option for "I only speak use new-style modules". And there could even be a conversion program that moves the modules as appropriate, fixes up the .git files in the working tree, and then sets that config. In fact, I think that config option _could_ be done by bumping core.repositoryformatversion and then setting extensions.submodulenames to "url" or something. Then you could never run into the confusing case where you have a clone done by a new version of git (using new-style names), but using an old-style version gets confused because it can't find the module directories (instead, it would barf and say "I don't know about that extension"). I don't know if any of that is worth it, though. We already fixed the problem from 0383bbb901. There may be a _different_ "break out of the modules directory" vulnerability, but since we disallow ".." it's hard to see what it would be (the best I could come up with is maybe pointing one module into the interior of another module, but I think you'd have to trouble overwriting anything useful). And while an old-style version of Git being confused might be annoying, I suspect that bumping the repository version would be even _more_ annoying, because it would hit every command, not just ones that try to touch those submodules. > diff --git a/t/t7400-submodule-basic.sh b/t/t7400-submodule-basic.sh > index 2c2c97e144..963693332c 100755 > --- a/t/t7400-submodule-basic.sh > +++ b/t/t7400-submodule-basic.sh > @@ -933,7 +933,7 @@ test_expect_success 'recursive relative submodules stay > relative' ' > cd clone2 && > git submodule update --init --recursive && > echo "gitdir: ../.git/modules/sub3" >./sub3/.git_expect && > - echo "gitdir: ../../../.git/modules/sub3/modules/dirdir/subsub" > >./sub3/dirdir/subsub/.git_expect > + echo "gitdir: > ../../../.git/modules/sub3/modules/dirdir%2fsubsub" > >./sub3/dirdir/subsub/.git_expect One interesting thing about url-encoding is that it's not one-to-one. This case could also be %2F, which is a different file (on a case-sensitive filesystem). I think "%20" and "+" are similarly interchangeable. If we were decoding the filenames, that's fine. The round-trip is lossless. But that's not quite how the new code behaves. We encode the input and then check to see if it matches an encoding we previously performed. So if our urlencode routines ever change, this will subtly break. I don't know how much it's worth caring about. We're not that likely to change the routines ourself (though certainly a third-party implementation would need to know our exact url-encoding decisions). Some possible actions: 0. Do nothing, and cross our fingers. ;) 1. Don't use strbuf_addstr_urlencode(), but rather our own munging function which we know will remain stable (or alternatively, a flag to strbuf_addstr_urlencode to get the consistent behavior). 2. Make sure we have tests which cover this, so at least somebody changing the urlencode decisions will see a breakage. Your test here covers the upper/lowercase one, but we might want one that covers "+". (There may be more ambiguous cases, but those are the ones I know about). 3. Rather than check for the existence of names, decode what's actually in the modules/ directory to create an in-memory index of names. I hesitate to suggest that,