Re: [PATCH 2/2] submodule: munge paths to submodule git directories

2018-08-29 Thread Jeff King
On Wed, Aug 29, 2018 at 02:09:13PM -0700, Jonathan Nieder wrote:

> Jeff King wrote:
> > On Tue, Aug 28, 2018 at 02:35:25PM -0700, Stefan Beller wrote:
> 
> >> Yeah, then let's just convert '/' with as little overhead as possible.
> >
> > Do you care about case-folding issues (e.g., submodules "FOO" and "foo"
> > colliding)?
> >
> > I'm OK if the answer is "no", but if you do want to deal with it, the
> > time is probably now.
> 
> Have we rejected the config approach?  I really liked the attribute of
> not having to solve everything right away.  I'm getting scared that
> we've forgotten that goal.

I personally have no problem with that approach, but I also haven't
thought that hard about it (I was mostly ignoring the discussion since
it seemed like submodule-interested folks, but I happened to see what
looked like a potentially bad idea cc'd to me ;) ).

-Peff


Re: [PATCH 2/2] submodule: munge paths to submodule git directories

2018-08-29 Thread Jeff King
On Wed, Aug 29, 2018 at 02:10:37PM -0700, Stefan Beller wrote:

> > Yes, that makes even the capitalized "CON" issues go away. It's not a
> > one-to-one mapping, though ("foo-" and "foo_" map to the same entity).
> 
> foo_ would map to foo__, and foo- would map to something else.
> (foo- as we do not rewrite dashes, yet?)

Ah, OK, I took your:

>   [A-Z]  -> _[a-z]

to mean "A-Z becomes a-z, and everything else becomes underscore".

If you mean a real one-to-one mapping that allows a-z and only a few
safe metacharacters, then yeah, that's what I was thinking, too.

> > If we want that, too, I think something like url-encoding is fine, with
> > the caveat that we simply urlencode _more_ things (i.e., anything not in
> > [a-z_]).
> 
> Yeah I think we need more than url encoding now.

If you take "url encoding" to only be the mechanical transformation of
quoting, not the set of _what_ gets quoting, we can still stick with it.
We don't need to, but it's probably no worse than inventing our own
set of quoting rules.

-Peff


Re: [PATCH 2/2] submodule: munge paths to submodule git directories

2018-08-29 Thread Stefan Beller
On Wed, Aug 29, 2018 at 2:18 PM Jonathan Nieder  wrote:
>
> Hi,
>
> Stefan Beller wrote:
>
> >> Yes, that makes even the capitalized "CON" issues go away. It's not a
> >> one-to-one mapping, though ("foo-" and "foo_" map to the same entity).
> >
> > foo_ would map to foo__, and foo- would map to something else.
> > (foo- as we do not rewrite dashes, yet?)
> >
> >> If we want that, too, I think something like url-encoding is fine, with
> >> the caveat that we simply urlencode _more_ things (i.e., anything not in
> >> [a-z_]).
> >
> > Yeah I think we need more than url encoding now.
>
> Can you say more?

https://public-inbox.org/git/CAGZ79kZv4BjRq=kq_1UeT2Kn38OZwYFgnMsTe6X_WP41=hb...@mail.gmail.com/

> Can you spell out for me what problem we're solving with something more 
> custom?

case sensitivity for example.

> the ability to tweak things later.

That is unrelated to the choice of encoding, but more related to
https://public-inbox.org/git/20180816181940.46114-1-bmw...@google.com/


Re: [PATCH 2/2] submodule: munge paths to submodule git directories

2018-08-29 Thread Brandon Williams
On 08/29, Stefan Beller wrote:
> On Wed, Aug 29, 2018 at 2:09 PM Jonathan Nieder  wrote:
> >
> > Jeff King wrote:
> > > On Tue, Aug 28, 2018 at 02:35:25PM -0700, Stefan Beller wrote:
> >
> > >> Yeah, then let's just convert '/' with as little overhead as possible.
> > >
> > > Do you care about case-folding issues (e.g., submodules "FOO" and "foo"
> > > colliding)?
> > >
> > > I'm OK if the answer is "no", but if you do want to deal with it, the
> > > time is probably now.
> >
> > Have we rejected the config approach?
> 
> I did not reject that approach, but am rather waiting for patches. ;-)

Note I did send out a patch using this approach, so no need to wait any
longer! :D

https://public-inbox.org/git/20180816181940.46114-1-bmw...@google.com/

-- 
Brandon Williams


Re: [PATCH 2/2] submodule: munge paths to submodule git directories

2018-08-29 Thread Jonathan Nieder
Hi,

Stefan Beller wrote:

>> Yes, that makes even the capitalized "CON" issues go away. It's not a
>> one-to-one mapping, though ("foo-" and "foo_" map to the same entity).
>
> foo_ would map to foo__, and foo- would map to something else.
> (foo- as we do not rewrite dashes, yet?)
>
>> If we want that, too, I think something like url-encoding is fine, with
>> the caveat that we simply urlencode _more_ things (i.e., anything not in
>> [a-z_]).
>
> Yeah I think we need more than url encoding now.

Can you say more?  Perhaps my expectations have been poisoned by tools
like dpkg-buildpackage that use urlencode.  As far as I can tell, it
works fine.

Moreover, urlencode has some attributes that make it a good potential
fit: it's intuitive, it's unambiguous (yes, it's one-to-many, but at
least it's not many-to-many), and people know how to deal with it from
their lives using browsers.  Can you spell out for me what problem
we're solving with something more custom?

Stepping back, I am very worried about any design that doesn't give us
the ability to tweak things later.  See [1] and [2] for more on that
subject.

Thanks,
Jonathan

[1] 
https://public-inbox.org/git/20180816023446.ga127...@aiede.svl.corp.google.com/
[2] 
https://public-inbox.org/git/20180829210913.gf7...@aiede.svl.corp.google.com/


Re: [PATCH 2/2] submodule: munge paths to submodule git directories

2018-08-29 Thread Stefan Beller
On Wed, Aug 29, 2018 at 2:09 PM Jonathan Nieder  wrote:
>
> Jeff King wrote:
> > On Tue, Aug 28, 2018 at 02:35:25PM -0700, Stefan Beller wrote:
>
> >> Yeah, then let's just convert '/' with as little overhead as possible.
> >
> > Do you care about case-folding issues (e.g., submodules "FOO" and "foo"
> > colliding)?
> >
> > I'm OK if the answer is "no", but if you do want to deal with it, the
> > time is probably now.
>
> Have we rejected the config approach?

I did not reject that approach, but am rather waiting for patches. ;-)

> I really liked the attribute of
> not having to solve everything right away.  I'm getting scared that
> we've forgotten that goal.

Eh, sorry for side tracking this issue.

I am just under the impression that the URL encoding is not particularly
good for our use case as it solves just one out of many things, whereas
the one thing (having no slashes) can also be solved in an easier way.

> It mixes well with Stefan's idea of setting up a new .git/submodules/
> directory.  We could require that everything in .git/submodules/ have
> configuration (or that everything in that directory either have
> configuration or be the result of a "very simple" transformation) and
> that way, all ambiguity goes away.

I would not want to have a world where we require that config, but I
would agree to the latter, hence we would need to discuss "very simple".
I guess that are 2 or 3 rules at most.

> Part of the definition of "very simple" could be that the submodule
> name must consist of some whitelisted list of characters (including no
> uppercase), for example.

Good catch!

Thanks for chiming in,
Stefan


Re: [PATCH 2/2] submodule: munge paths to submodule git directories

2018-08-29 Thread Stefan Beller
> Yes, that makes even the capitalized "CON" issues go away. It's not a
> one-to-one mapping, though ("foo-" and "foo_" map to the same entity).

foo_ would map to foo__, and foo- would map to something else.
(foo- as we do not rewrite dashes, yet?)

>
> If we want that, too, I think something like url-encoding is fine, with
> the caveat that we simply urlencode _more_ things (i.e., anything not in
> [a-z_]).

Yeah I think we need more than url encoding now.


Re: [PATCH 2/2] submodule: munge paths to submodule git directories

2018-08-29 Thread Jonathan Nieder
Jeff King wrote:
> On Tue, Aug 28, 2018 at 02:35:25PM -0700, Stefan Beller wrote:

>> Yeah, then let's just convert '/' with as little overhead as possible.
>
> Do you care about case-folding issues (e.g., submodules "FOO" and "foo"
> colliding)?
>
> I'm OK if the answer is "no", but if you do want to deal with it, the
> time is probably now.

Have we rejected the config approach?  I really liked the attribute of
not having to solve everything right away.  I'm getting scared that
we've forgotten that goal.

It mixes well with Stefan's idea of setting up a new .git/submodules/
directory.  We could require that everything in .git/submodules/ have
configuration (or that everything in that directory either have
configuration or be the result of a "very simple" transformation) and
that way, all ambiguity goes away.

Part of the definition of "very simple" could be that the submodule
name must consist of some whitelisted list of characters (including no
uppercase), for example.

Thanks,
Jonathan


Re: [PATCH 2/2] submodule: munge paths to submodule git directories

2018-08-29 Thread Jeff King
On Wed, Aug 29, 2018 at 11:10:51AM -0700, Stefan Beller wrote:

> > Do you care about case-folding issues (e.g., submodules "FOO" and "foo"
> > colliding)?
> 
> I do. :(
> 
> 2d84f13dcb6 (config: fix case sensitive subsection names on writing, 
> 2018-08-08)
> explains the latest episode of case folding with submodules involved.
> 
> > I'm OK if the answer is "no", but if you do want to deal with it, the
> > time is probably now.
> 
> Good point. But as soon as we start discussing case sensitivity, we
> are drawn down the rabbit hole of funny file names. (Try naming
> a submodule "CON1" and obtain it on Windows for example)
> So we would need to have a file system specific encoding function for
> submodule names, which sounds like a maintenance night mare.

Hmph. I'd hoped that simply escaping metacharacters and doing some
obvious case-folding would be enough. And I think that would cover most
accidental cases. But yeah, Windows reserved names are basically
indistinguishable from reasonable names. They'd probably need
special-cased.

OTOH, I'm not sure how we handle those for entries in the actual tree.
Poking around git-for-windows/git, I think it uses the magic "\\?"
marker to tell the OS to interpret the name literally.

So I wonder if it might be sufficient to just deal with the more obvious
folding issues. Or as you noted, if we just choose lowercase names as
the normalized form, that might also be enough. :)

> So if I was thinking in the scheme presented above, we could just
> have another rule that is
> 
>   [A-Z]  -> _[a-z]
> 
> (lowercase capital letters and escape them with an underscore)

Yes, that makes even the capitalized "CON" issues go away. It's not a
one-to-one mapping, though ("foo-" and "foo_" map to the same entity).

If we want that, too, I think something like url-encoding is fine, with
the caveat that we simply urlencode _more_ things (i.e., anything not in
[a-z_]).

-Peff


Re: [PATCH 2/2] submodule: munge paths to submodule git directories

2018-08-29 Thread Stefan Beller
On Tue, Aug 28, 2018 at 10:25 PM Jeff King  wrote:
>
> On Tue, Aug 28, 2018 at 02:35:25PM -0700, Stefan Beller wrote:
>
> > 3) (optional) instead of putting it all in modules/, use another
> >directory gitmodules/ for example. this will make sure we can tell
> >if a repository has been converted or is stuck with a setup of a
> >current git.
>
> I actually kind of like that idea, as it makes the interaction between
> old and new names much simpler to reason about.
>
> And since old code won't know about the new names anyway, there's in
> theory no downside. In practice, of course, the encoding may often be a
> noop, and lazy scripts would continue to work most of the time if you
> didn't change out the prefix directory. I'm not sure if that is an
> argument for the scheme (because it will suss out broken scripts more
> consistently) or against it (because 99% of the time those old scripts
> would just happen to work).
>
> > > This is exactly the reason why I wanted to get some opinions on what the
> > > best thing to do here would be.  I _think_ the best thing would probably
> > > be to write a specific routine to do the conversion, and it wouldn't
> > > even have to be all that complex.  Basically I'm just interested in
> > > converting '/' characters so that things no longer behave like
> > > nested directories.
> >
> > Yeah, then let's just convert '/' with as little overhead as possible.
>
> Do you care about case-folding issues (e.g., submodules "FOO" and "foo"
> colliding)?

I do. :(

2d84f13dcb6 (config: fix case sensitive subsection names on writing, 2018-08-08)
explains the latest episode of case folding with submodules involved.

> I'm OK if the answer is "no", but if you do want to deal with it, the
> time is probably now.

Good point. But as soon as we start discussing case sensitivity, we
are drawn down the rabbit hole of funny file names. (Try naming
a submodule "CON1" and obtain it on Windows for example)
So we would need to have a file system specific encoding function for
submodule names, which sounds like a maintenance night mare.

The CON1 example shows that URL encoding may not be enough
on Windows and we'd have to extend the encoding if we care about
FS issues.

Another example would be "a" and "a\b" which would be a mess
in Windows as the '\' would work as a dir separator whereas these
two names were ok on linux. This would be fixed with url encoding.

URL encoding would not fix the case-folding issue that you
mentioned above.

So if I was thinking in the scheme presented above, we could just
have another rule that is

  [A-Z]  -> _[a-z]

(lowercase capital letters and escape them with an underscore)

But with that rule added, we are inventing a really complicated
encoding scheme already.


Re: [PATCH 2/2] submodule: munge paths to submodule git directories

2018-08-28 Thread Jeff King
On Tue, Aug 28, 2018 at 02:35:25PM -0700, Stefan Beller wrote:

> 3) (optional) instead of putting it all in modules/, use another
>directory gitmodules/ for example. this will make sure we can tell
>if a repository has been converted or is stuck with a setup of a
>current git.

I actually kind of like that idea, as it makes the interaction between
old and new names much simpler to reason about.

And since old code won't know about the new names anyway, there's in
theory no downside. In practice, of course, the encoding may often be a
noop, and lazy scripts would continue to work most of the time if you
didn't change out the prefix directory. I'm not sure if that is an
argument for the scheme (because it will suss out broken scripts more
consistently) or against it (because 99% of the time those old scripts
would just happen to work).

> > This is exactly the reason why I wanted to get some opinions on what the
> > best thing to do here would be.  I _think_ the best thing would probably
> > be to write a specific routine to do the conversion, and it wouldn't
> > even have to be all that complex.  Basically I'm just interested in
> > converting '/' characters so that things no longer behave like
> > nested directories.
> 
> Yeah, then let's just convert '/' with as little overhead as possible.

Do you care about case-folding issues (e.g., submodules "FOO" and "foo"
colliding)?

I'm OK if the answer is "no", but if you do want to deal with it, the
time is probably now.

-Peff


Re: [PATCH 2/2] submodule: munge paths to submodule git directories

2018-08-28 Thread Stefan Beller
> > > -   echo "gitdir: 
> > > ../../../.git/modules/sub3/modules/dirdir/subsub" 
> > > >./sub3/dirdir/subsub/.git_expect
> > > +   echo "gitdir: 
> > > ../../../.git/modules/sub3/modules/dirdir%2fsubsub" 
> > > >./sub3/dirdir/subsub/.git_expect
> >
> > One interesting thing about url-encoding is that it's not one-to-one.
> > This case could also be %2F, which is a different file (on a
> > case-sensitive filesystem). I think "%20" and "+" are similarly
> > interchangeable.
> >
> > If we were decoding the filenames, that's fine. The round-trip is
> > lossless.
> >
> > But that's not quite how the new code behaves. We encode the input and
> > then check to see if it matches an encoding we previously performed. So
> > if our urlencode routines ever change, this will subtly break.

And this is the problem:
a) we have a 'complicated' encoding here, which must never change
b) the "encode and check if it matches", will produce ugly code going forward,
as it tries to differentiate between submodules named "url_encoded(a)"
and "a" (e.g. "a%20b" and "a b" would conflict and we have to resolve
the conflict, although those two names are perfectly fine as they do not
have the original problem of having slashes)

Hence I would propose a simpler encoding:

1)/ -> _ ( replace a slash by an underscore)
2)_ -> __ (replace any underscore by 2 underscores, this is just the
  escaping mechanism to differentiate a/b and a_b)

3) (optional) instead of putting it all in modules/, use another
directory gitmodules/
for example. this will make sure we can tell if a repository has
been converted
or is stuck with a setup of a current git.

> This is exactly the reason why I wanted to get some opinions on what the
> best thing to do here would be.  I _think_ the best thing would probably
> be to write a specific routine to do the conversion, and it wouldn't
> even have to be all that complex.  Basically I'm just interested in
> converting '/' characters so that things no longer behave like
> nested directories.

Yeah, then let's just convert '/' with as little overhead as possible.

Thanks,
Stefan


Re: [PATCH 2/2] submodule: munge paths to submodule git directories

2018-08-16 Thread Brandon Williams
On 08/15, Jonathan Nieder wrote:
> Stefan Beller wrote:
> > Jonathan Nieder wrote:
> 
> >> All at the cost of recording a little configuration somewhere.  If we
> >> want to decrease the configuration, we can avoid recording it there in
> >> the easy cases (e.g. when name == gitdirname).  That's "just" an
> >> optimization.
> >
> > Sounds good, but gerrit for example would not take advantage of such
> > optimisation as they have slashes in their submodules. :-(
> > I wonder if we can optimize further and keep slashes if there is
> > no conflict (as then name == gitdirname, so it can be optimized).
> 
> One possibility would be to treat gsub("/", "%2f") as another of the
> easy cases.
> 
> [...]
> >> And then we have the ability later to handle all the edge cases we
> >> haven't handled yet today:
> >>
> >> - sharding when the number of submodules is too large
> >> - case-insensitive filesystems
> >> - path name length limits
> >> - different sets of filesystem-special characters
> >>
> >> Sane?

Seems like a sensible thing to do. Let me work up some patches to
implement this using config primarily and these other schemes as
fallbacks.

> >
> > I'll keep thinking about it.
> 
> Thanks.
> 
> > FYI: the reduction in configuration was just sent out.
> 
> https://public-inbox.org/git/20180816023100.161626-1-sbel...@google.com/
> for those following along.
> 
> Ciao,
> Jonathan

-- 
Brandon Williams


Re: [PATCH 2/2] submodule: munge paths to submodule git directories

2018-08-16 Thread Junio C Hamano
Jonathan Nieder  writes:

> So it would be nice, for future-proofing, if we can change the naming
> scheme later.
> ...
> All at the cost of recording a little configuration somewhere.  If we
> want to decrease the configuration, we can avoid recording it there in
> the easy cases (e.g. when name == gitdirname).  That's "just" an
> optimization.
>
> And then we have the ability later to handle all the edge cases we
> haven't handled yet today:
>
> - sharding when the number of submodules is too large
> - case-insensitive filesystems
> - path name length limits
> - different sets of filesystem-special characters
>
> Sane?

Yup.



Re: [PATCH 2/2] submodule: munge paths to submodule git directories

2018-08-15 Thread Jonathan Nieder
Stefan Beller wrote:
> Jonathan Nieder wrote:

>> All at the cost of recording a little configuration somewhere.  If we
>> want to decrease the configuration, we can avoid recording it there in
>> the easy cases (e.g. when name == gitdirname).  That's "just" an
>> optimization.
>
> Sounds good, but gerrit for example would not take advantage of such
> optimisation as they have slashes in their submodules. :-(
> I wonder if we can optimize further and keep slashes if there is
> no conflict (as then name == gitdirname, so it can be optimized).

One possibility would be to treat gsub("/", "%2f") as another of the
easy cases.

[...]
>> And then we have the ability later to handle all the edge cases we
>> haven't handled yet today:
>>
>> - sharding when the number of submodules is too large
>> - case-insensitive filesystems
>> - path name length limits
>> - different sets of filesystem-special characters
>>
>> Sane?
>
> I'll keep thinking about it.

Thanks.

> FYI: the reduction in configuration was just sent out.

https://public-inbox.org/git/20180816023100.161626-1-sbel...@google.com/
for those following along.

Ciao,
Jonathan


Re: [PATCH 2/2] submodule: munge paths to submodule git directories

2018-08-15 Thread Stefan Beller
> [...]

all good reasons; ship it :-)

> All at the cost of recording a little configuration somewhere.  If we
> want to decrease the configuration, we can avoid recording it there in
> the easy cases (e.g. when name == gitdirname).  That's "just" an
> optimization.

Sounds good, but gerrit for example would not take advantage of such
optimisation as they have slashes in their submodules. :-(
I wonder if we can optimize further and keep slashes if there is
no conflict (as then name == gitdirname, so it can be optimized).

> And then we have the ability later to handle all the edge cases we
> haven't handled yet today:
>
> - sharding when the number of submodules is too large
> - case-insensitive filesystems
> - path name length limits
> - different sets of filesystem-special characters
>
> Sane?

I'll keep thinking about it.

FYI: the reduction in configuration was just sent out.

Thanks,
Stefan


Re: [PATCH 2/2] submodule: munge paths to submodule git directories

2018-08-15 Thread Jonathan Nieder
Hi again,

Stefan Beller wrote:
> On Tue, Aug 14, 2018 at 2:12 PM Jonathan Nieder  wrote:

>> What if we forbid directory separator characters in the gitdirname?
>
> Fine with me, but ideally we'd want to allow sharding the
> submodules. When you have 1000 submodules
> we'd want them not all inside the toplevel "modules/" ?

That's a good reason to permit slashes in the gitdirname.

If I understood the rest of your reply correctly, your worry was about
dangerous gitdirname values in .gitmodules.  I never had any wish to
read them from there anyway, so this worry hopefully goes away.

[...]
>> In this proposal, it would only be read from config, not from
>> .gitmodules.
>
> Ah good point. That makes sense.
>
> Stepping back a bit regarding the config:
[...]
> Now that we have the submodule.active or even
> submodule..active flags, we do not need (b) any more.
> So the URL turns into a useless piece of cruft that just is unneeded
> and might confuse the user.
>
> So maybe I'd want to propose a patch that removes
> submodule..url from the config once it is cloned.
> (I just read up on "submodule sync" again, but that might not
> even need special care for this new world)
>
> And with all that said, I think if we can avoid having the submodules
> gitdir in the config, the config would look much cleaner, too.

Yes, I understand and agree with this.

I should further spell out my motivation with this gitdirname
suggestion.  The issue that some people have mentioned in this thread
is that urlencoding might not be perfect --- it's pretty close to
perfect, but it's likely we'll come up with some unanticipated needs
later (like sharding) that it doesn't solve.  Solving those all right
now would not necessarily be wise, since the thing about unanticipated
needs is that you never know in advance what they will be. ;-)

So it would be nice, for future-proofing, if we can change the naming
scheme later.

As a bonus, that would also make interoperability with other
implementations easier.  For example, suppose we mess up in JGit and
urlencode a different set of characters than Git does.  Then a mixed
Git + JGit installation would have this subtle bug of the submodule
.git directory not being reused when I switch to and from and branch
not containing that submodule, in some circumstances.  That sounds
difficult to support.

Whereas if we have a gitdirname configuration variable, then JGit and
libgit2 and go-git do not have to match the naming scheme Git chooses.
They can try, but if one gets it subtly wrong then that is okay
because the submodule's directory name is right there and easy to look
up.

All at the cost of recording a little configuration somewhere.  If we
want to decrease the configuration, we can avoid recording it there in
the easy cases (e.g. when name == gitdirname).  That's "just" an
optimization.

And then we have the ability later to handle all the edge cases we
haven't handled yet today:

- sharding when the number of submodules is too large
- case-insensitive filesystems
- path name length limits
- different sets of filesystem-special characters

Sane?

Thanks,
Jonathan


Re: [PATCH 2/2] submodule: munge paths to submodule git directories

2018-08-15 Thread Aaron Schrab

At 15:33 -0700 08 Aug 2018, Brandon Williams  wrote:

Teach "submodule_name_to_gitdir()" to munge a submodule's name (by url
encoding it) before using it to build a path to the submodule's gitdir.


Seems like this will be a problem if it results in names that exceed 
NAME_MAX? On common systems that's 255, so it's probably not going to be 
common; but it certainly could for some repositories.


Re: [PATCH 2/2] submodule: munge paths to submodule git directories

2018-08-14 Thread Stefan Beller
On Tue, Aug 14, 2018 at 2:12 PM Jonathan Nieder  wrote:
>
> Hi,
>
> Stefan Beller wrote:
> > On Tue, Aug 14, 2018 at 11:57 AM Jonathan Nieder  wrote:
>
> >> Second, what if we store the pathname in config?  We already store the
> >> URL there:
> >>
> >> [submodule "plugins/hooks"]
> >> url = https://gerrit.googlesource.com/plugins/hooks
> >>
> >> So we could (as a followup patch) do something like
> >>
> >> [submodule "plugins/hooks"]
> >> url = https://gerrit.googlesource.com/plugins/hooks
> >> gitdirname = plugins%2fhooks
> >>
> >> and use that for lookups instead of regenerating the directory name.
> >> What do you think?
> >
> > As I just looked at worktree code, this sounds intriguing for the wrong
> > reason (again), as a user may want to point the gitdirname to a repository
> > that they have already on disk outside the actual superproject. They
> > would be reinventing worktrees in the submodule space. ;-)
> >
> > This would open up the security hole that we just had, again.
> > So we'd have to make sure that the gitdirname (instead of the
> > now meaningless subsection name) is proof to ../ attacks.
> >
> > I feel uneasy about this as then the user might come in
> > and move submodules and repoint the gitdirname...
> > to a not url encoded path. Exposing this knob just
> > asks for trouble, no?
>
> What if we forbid directory separator characters in the gitdirname?

Fine with me, but ideally we'd want to allow sharding the
submodules. When you have 1000 submodules
we'd want them not all inside the toplevel "modules/" ?
Up to now we could just wave hands and claim the user
(who is clearly experienced with submodules as they
use so many of them) would shard it properly.

With this scheme we loose the ability to shard.

> [...]
> > What would happen if gitdirname is changed as part of
> > history? (The same problem we have now with changing
> > the subsection name)
>
> In this proposal, it would only be read from config, not from
> .gitmodules.

Ah good point. That makes sense.

Stepping back a bit regarding the config:
When I clone gerrit (or any repo using submodules)

$ git clone --recurse-submodules \
  https://gerrit.googlesource.com/gerrit g2
[...]
$ cat g2/.git/config
[submodule]
active = .
[submodule "plugins/codemirror-editor"]
url = https://gerrit.googlesource.com/plugins/codemirror-editor
[... more urls to follow...]

Originally we have had the url in the config, (a) that we can change
the URLs after the "git submodule init" and "git submodule update"
step that actually clones the submodule if not present and much more
importantly (b) to know which submodule "was initialized/active".

Now that we have the submodule.active or even
submodule..active flags, we do not need (b) any more.
So the URL turns into a useless piece of cruft that just is unneeded
and might confuse the user.

So maybe I'd want to propose a patch that removes
submodule..url from the config once it is cloned.
(I just read up on "submodule sync" again, but that might not
even need special care for this new world)

And with all that said, I think if we can avoid having the submodules
gitdir in the config, the config would look much cleaner, too.

But maybe that is the wrong thing to optimize for. ;-)
It just demonstrates that we'd have a submodule specific
thing again in the config.

So my preference would be to do a similar thing as
url-encoding as that solves the issue of slashes and
potentially of case sensitivity (e.g. encode upper case A
as lower case with underscore _a)

However the transition worries me, as it transitions
within the same namespace. Back then when we
transferred from the .git dir inside the submodules
working tree to the embedded version in the superprojects
.git dir, there was no overlap, and any potential directory
in .git/modules/ that was already there, was highly
unusual, so asking the user for help is the reasonable
thing to do.
But now we might run into issues that has overlap between
old(name as is) and new (urlencoded) world.

So maybe we also want to transition from

modules/

to

submodules/)>

Thanks,
Stefan


Re: [PATCH 2/2] submodule: munge paths to submodule git directories

2018-08-14 Thread Jonathan Nieder
Hi,

Stefan Beller wrote:
> On Tue, Aug 14, 2018 at 11:57 AM Jonathan Nieder  wrote:

>> Second, what if we store the pathname in config?  We already store the
>> URL there:
>>
>> [submodule "plugins/hooks"]
>> url = https://gerrit.googlesource.com/plugins/hooks
>>
>> So we could (as a followup patch) do something like
>>
>> [submodule "plugins/hooks"]
>> url = https://gerrit.googlesource.com/plugins/hooks
>> gitdirname = plugins%2fhooks
>>
>> and use that for lookups instead of regenerating the directory name.
>> What do you think?
>
> As I just looked at worktree code, this sounds intriguing for the wrong
> reason (again), as a user may want to point the gitdirname to a repository
> that they have already on disk outside the actual superproject. They
> would be reinventing worktrees in the submodule space. ;-)
>
> This would open up the security hole that we just had, again.
> So we'd have to make sure that the gitdirname (instead of the
> now meaningless subsection name) is proof to ../ attacks.
>
> I feel uneasy about this as then the user might come in
> and move submodules and repoint the gitdirname...
> to a not url encoded path. Exposing this knob just
> asks for trouble, no?

What if we forbid directory separator characters in the gitdirname?

[...]
> What would happen if gitdirname is changed as part of
> history? (The same problem we have now with changing
> the subsection name)

In this proposal, it would only be read from config, not from
.gitmodules.

Thanks,
Jonathan


Re: [PATCH 2/2] submodule: munge paths to submodule git directories

2018-08-14 Thread Stefan Beller
On Tue, Aug 14, 2018 at 11:57 AM Jonathan Nieder  wrote:
>
> Hi,
>
> Brandon Williams wrote:
> > On 08/09, Jeff King wrote:
>
> >> One interesting thing about url-encoding is that it's not one-to-one.
> >> This case could also be %2F, which is a different file (on a
> >> case-sensitive filesystem). I think "%20" and "+" are similarly
> >> interchangeable.
> >>
> >> If we were decoding the filenames, that's fine. The round-trip is
> >> lossless.
> >>
> >> But that's not quite how the new code behaves. We encode the input and
> >> then check to see if it matches an encoding we previously performed. So
> >> if our urlencode routines ever change, this will subtly break.
> >>
> >> I don't know how much it's worth caring about. We're not that likely to
> >> change the routines ourself (though certainly a third-party
> >> implementation would need to know our exact url-encoding decisions).
> >
> > This is exactly the reason why I wanted to get some opinions on what the
> > best thing to do here would be.  I _think_ the best thing would probably
> > be to write a specific routine to do the conversion, and it wouldn't
> > even have to be all that complex.  Basically I'm just interested in
> > converting '/' characters so that things no longer behave like
> > nested directories.
>
> First of all, I think the behavior with this patch is already much
> better than the previous status quo.  I'm using the patch now and am
> very happy with it.
>
> Second, what if we store the pathname in config?  We already store the
> URL there:
>
> [submodule "plugins/hooks"]
> url = https://gerrit.googlesource.com/plugins/hooks
>
> So we could (as a followup patch) do something like
>
> [submodule "plugins/hooks"]
> url = https://gerrit.googlesource.com/plugins/hooks
> gitdirname = plugins%2fhooks
>
> and use that for lookups instead of regenerating the directory name.
> What do you think?

As I just looked at worktree code, this sounds intriguing for the wrong
reason (again), as a user may want to point the gitdirname to a repository
that they have already on disk outside the actual superproject. They
would be reinventing worktrees in the submodule space. ;-)

This would open up the security hole that we just had, again.
So we'd have to make sure that the gitdirname (instead of the
now meaningless subsection name) is proof to ../ attacks.

I feel uneasy about this as then the user might come in
and move submodules and repoint the gitdirname...
to a not url encoded path. Exposing this knob just
asks for trouble, no?

On the other hand, the only requirement for the "name" is
now uniqueness, and that is implied with subsections,
so I guess it looks elegant.

What would happen if gitdirname is changed as part of
history? (The same problem we have now with changing
the subsection name)

The more I think about it the less appealing this is, but it looks
elegant.

Stefan


Re: [PATCH 2/2] submodule: munge paths to submodule git directories

2018-08-14 Thread Jeff King
On Tue, Aug 14, 2018 at 11:04:06AM -0700, Brandon Williams wrote:

> > I think this backwards-compatibility is necessary to avoid pain. But
> > until it goes away, I don't think this is helping the vulnerability from
> > 0383bbb901. Because there the issue was that the submodule name pointed
> > back into the working tree, so this access() would find the untrusted
> > working tree code and say "ah, an old-fashioned name!".
> [...]
> 
> Oh I know that this doesn't help with that vulnerability.  As you've
> said we fix it and now disallow ".." at the submodule-config level so
> really this path is simply about using what we get out of
> submodule-config in a more sane manor.

OK, I'm alright with that as long as we are all on the same page. I
think I mistook "this addresses the vulnerability" from your commit
message the wrong way. I took it as "this patch", but reading it again,
you simply mean "the '..' handling we already did".

I do think eventually dropping this back-compatibility could save us
from another directory-escape problem, but it's hard to justify the
real-world pain for a hypothetical benefit. Maybe in a few years we
could get rid of it in a major version bump.

> > One interesting thing about url-encoding is that it's not one-to-one.
> > This case could also be %2F, which is a different file (on a
> > case-sensitive filesystem). I think "%20" and "+" are similarly
> > interchangeable.
> > 
> > If we were decoding the filenames, that's fine. The round-trip is
> > lossless.
> > 
> > But that's not quite how the new code behaves. We encode the input and
> > then check to see if it matches an encoding we previously performed. So
> > if our urlencode routines ever change, this will subtly break.
> > 
> > I don't know how much it's worth caring about. We're not that likely to
> > change the routines ourself (though certainly a third-party
> > implementation would need to know our exact url-encoding decisions).
> 
> This is exactly the reason why I wanted to get some opinions on what the
> best thing to do here would be.  I _think_ the best thing would probably
> be to write a specific routine to do the conversion, and it wouldn't
> even have to be all that complex.  Basically I'm just interested in
> converting '/' characters so that things no longer behave like
> nested directories.

I think we benefit from catching names that would trigger filesystem
case-folding, too. If I have submodules with names "foo" and "FOO", we
would not want to confuse them (or at least we should confuse them
equally on all platforms). I doubt you can do anything malicious, but it
might simply be annoying.

That implies to me using a custom function (even if its encoded form
ends up being understandable as url-encoding).

-Peff


Re: [PATCH 2/2] submodule: munge paths to submodule git directories

2018-08-14 Thread Jonathan Nieder
Hi,

Brandon Williams wrote:
> On 08/09, Jeff King wrote:

>> One interesting thing about url-encoding is that it's not one-to-one.
>> This case could also be %2F, which is a different file (on a
>> case-sensitive filesystem). I think "%20" and "+" are similarly
>> interchangeable.
>>
>> If we were decoding the filenames, that's fine. The round-trip is
>> lossless.
>>
>> But that's not quite how the new code behaves. We encode the input and
>> then check to see if it matches an encoding we previously performed. So
>> if our urlencode routines ever change, this will subtly break.
>>
>> I don't know how much it's worth caring about. We're not that likely to
>> change the routines ourself (though certainly a third-party
>> implementation would need to know our exact url-encoding decisions).
>
> This is exactly the reason why I wanted to get some opinions on what the
> best thing to do here would be.  I _think_ the best thing would probably
> be to write a specific routine to do the conversion, and it wouldn't
> even have to be all that complex.  Basically I'm just interested in
> converting '/' characters so that things no longer behave like
> nested directories.

First of all, I think the behavior with this patch is already much
better than the previous status quo.  I'm using the patch now and am
very happy with it.

Second, what if we store the pathname in config?  We already store the
URL there:

[submodule "plugins/hooks"]
url = https://gerrit.googlesource.com/plugins/hooks

So we could (as a followup patch) do something like

[submodule "plugins/hooks"]
url = https://gerrit.googlesource.com/plugins/hooks
gitdirname = plugins%2fhooks

and use that for lookups instead of regenerating the directory name.
What do you think?

Thanks,
Jonathan


Re: [PATCH 2/2] submodule: munge paths to submodule git directories

2018-08-14 Thread Brandon Williams
On 08/09, Jeff King wrote:
> On Wed, Aug 08, 2018 at 03:33:23PM -0700, Brandon Williams wrote:
> 
> > Commit 0383bbb901 (submodule-config: verify submodule names as paths,
> > 2018-04-30) introduced some checks to ensure that submodule names don't
> > include directory traversal components (e.g. "../").
> > 
> > This addresses the vulnerability identified in 0383bbb901 but the root
> > cause is that we use submodule names to construct paths to the
> > submodule's git directory.  What we really should do is munge the
> > submodule name before using it to construct a path.
> > 
> > Teach "submodule_name_to_gitdir()" to munge a submodule's name (by url
> > encoding it) before using it to build a path to the submodule's gitdir.
> 
> I like this approach very much, and I think using url encoding is much
> better than an opaque hash (purely because it makes debugging and
> inspection saner).
> 
> Two thoughts, though:
> 
> > +   modules_len = buf->len;
> > strbuf_addstr(buf, submodule_name);
> > +
> > +   /*
> > +* If the submodule gitdir already exists using the old-fashioned
> > +* location (which uses the submodule name as-is, without munging it)
> > +* then return that.
> > +*/
> > +   if (!access(buf->buf, F_OK))
> > +   return;
> 
> I think this backwards-compatibility is necessary to avoid pain. But
> until it goes away, I don't think this is helping the vulnerability from
> 0383bbb901. Because there the issue was that the submodule name pointed
> back into the working tree, so this access() would find the untrusted
> working tree code and say "ah, an old-fashioned name!".
> 
> In theory a fresh clone could set a config option for "I only speak
> use new-style modules". And there could even be a conversion program
> that moves the modules as appropriate, fixes up the .git files in the
> working tree, and then sets that config.
> 
> In fact, I think that config option _could_ be done by bumping
> core.repositoryformatversion and then setting extensions.submodulenames
> to "url" or something. Then you could never run into the confusing case
> where you have a clone done by a new version of git (using new-style
> names), but using an old-style version gets confused because it can't
> find the module directories (instead, it would barf and say "I don't
> know about that extension").
> 
> I don't know if any of that is worth it, though. We already fixed the
> problem from 0383bbb901. There may be a _different_ "break out of the
> modules directory" vulnerability, but since we disallow ".." it's hard
> to see what it would be (the best I could come up with is maybe pointing
> one module into the interior of another module, but I think you'd have
> to trouble overwriting anything useful).
> 
> And while an old-style version of Git being confused might be annoying,
> I suspect that bumping the repository version would be even _more_
> annoying, because it would hit every command, not just ones that try to
> touch those submodules.

Oh I know that this doesn't help with that vulnerability.  As you've
said we fix it and now disallow ".." at the submodule-config level so
really this path is simply about using what we get out of
submodule-config in a more sane manor.

> 
> > diff --git a/t/t7400-submodule-basic.sh b/t/t7400-submodule-basic.sh
> > index 2c2c97e144..963693332c 100755
> > --- a/t/t7400-submodule-basic.sh
> > +++ b/t/t7400-submodule-basic.sh
> > @@ -933,7 +933,7 @@ test_expect_success 'recursive relative submodules stay 
> > relative' '
> > cd clone2 &&
> > git submodule update --init --recursive &&
> > echo "gitdir: ../.git/modules/sub3" >./sub3/.git_expect &&
> > -   echo "gitdir: ../../../.git/modules/sub3/modules/dirdir/subsub" 
> > >./sub3/dirdir/subsub/.git_expect
> > +   echo "gitdir: 
> > ../../../.git/modules/sub3/modules/dirdir%2fsubsub" 
> > >./sub3/dirdir/subsub/.git_expect
> 
> One interesting thing about url-encoding is that it's not one-to-one.
> This case could also be %2F, which is a different file (on a
> case-sensitive filesystem). I think "%20" and "+" are similarly
> interchangeable.
> 
> If we were decoding the filenames, that's fine. The round-trip is
> lossless.
> 
> But that's not quite how the new code behaves. We encode the input and
> then check to see if it matches an encoding we previously performed. So
> if our urlencode routines ever change, this will subtly break.
> 
> I don't know how much it's worth caring about. We're not that likely to
> change the routines ourself (though certainly a third-party
> implementation would need to know our exact url-encoding decisions).

This is exactly the reason why I wanted to get some opinions on what the
best thing to do here would be.  I _think_ the best thing would probably
be to write a specific routine to do the conversion, and it wouldn't
even have to be all that complex.  Basically I'm just interested in
converting '/' characters so that things no 

Re: [PATCH 2/2] submodule: munge paths to submodule git directories

2018-08-09 Thread Jeff King
On Wed, Aug 08, 2018 at 03:33:23PM -0700, Brandon Williams wrote:

> Commit 0383bbb901 (submodule-config: verify submodule names as paths,
> 2018-04-30) introduced some checks to ensure that submodule names don't
> include directory traversal components (e.g. "../").
> 
> This addresses the vulnerability identified in 0383bbb901 but the root
> cause is that we use submodule names to construct paths to the
> submodule's git directory.  What we really should do is munge the
> submodule name before using it to construct a path.
> 
> Teach "submodule_name_to_gitdir()" to munge a submodule's name (by url
> encoding it) before using it to build a path to the submodule's gitdir.

I like this approach very much, and I think using url encoding is much
better than an opaque hash (purely because it makes debugging and
inspection saner).

Two thoughts, though:

> + modules_len = buf->len;
>   strbuf_addstr(buf, submodule_name);
> +
> + /*
> +  * If the submodule gitdir already exists using the old-fashioned
> +  * location (which uses the submodule name as-is, without munging it)
> +  * then return that.
> +  */
> + if (!access(buf->buf, F_OK))
> + return;

I think this backwards-compatibility is necessary to avoid pain. But
until it goes away, I don't think this is helping the vulnerability from
0383bbb901. Because there the issue was that the submodule name pointed
back into the working tree, so this access() would find the untrusted
working tree code and say "ah, an old-fashioned name!".

In theory a fresh clone could set a config option for "I only speak
use new-style modules". And there could even be a conversion program
that moves the modules as appropriate, fixes up the .git files in the
working tree, and then sets that config.

In fact, I think that config option _could_ be done by bumping
core.repositoryformatversion and then setting extensions.submodulenames
to "url" or something. Then you could never run into the confusing case
where you have a clone done by a new version of git (using new-style
names), but using an old-style version gets confused because it can't
find the module directories (instead, it would barf and say "I don't
know about that extension").

I don't know if any of that is worth it, though. We already fixed the
problem from 0383bbb901. There may be a _different_ "break out of the
modules directory" vulnerability, but since we disallow ".." it's hard
to see what it would be (the best I could come up with is maybe pointing
one module into the interior of another module, but I think you'd have
to trouble overwriting anything useful).

And while an old-style version of Git being confused might be annoying,
I suspect that bumping the repository version would be even _more_
annoying, because it would hit every command, not just ones that try to
touch those submodules.

> diff --git a/t/t7400-submodule-basic.sh b/t/t7400-submodule-basic.sh
> index 2c2c97e144..963693332c 100755
> --- a/t/t7400-submodule-basic.sh
> +++ b/t/t7400-submodule-basic.sh
> @@ -933,7 +933,7 @@ test_expect_success 'recursive relative submodules stay 
> relative' '
>   cd clone2 &&
>   git submodule update --init --recursive &&
>   echo "gitdir: ../.git/modules/sub3" >./sub3/.git_expect &&
> - echo "gitdir: ../../../.git/modules/sub3/modules/dirdir/subsub" 
> >./sub3/dirdir/subsub/.git_expect
> + echo "gitdir: 
> ../../../.git/modules/sub3/modules/dirdir%2fsubsub" 
> >./sub3/dirdir/subsub/.git_expect

One interesting thing about url-encoding is that it's not one-to-one.
This case could also be %2F, which is a different file (on a
case-sensitive filesystem). I think "%20" and "+" are similarly
interchangeable.

If we were decoding the filenames, that's fine. The round-trip is
lossless.

But that's not quite how the new code behaves. We encode the input and
then check to see if it matches an encoding we previously performed. So
if our urlencode routines ever change, this will subtly break.

I don't know how much it's worth caring about. We're not that likely to
change the routines ourself (though certainly a third-party
implementation would need to know our exact url-encoding decisions).

Some possible actions:

 0. Do nothing, and cross our fingers. ;)

 1. Don't use strbuf_addstr_urlencode(), but rather our own munging
function which we know will remain stable (or alternatively, a flag
to strbuf_addstr_urlencode to get the consistent behavior).

 2. Make sure we have tests which cover this, so at least somebody
changing the urlencode decisions will see a breakage. Your test here
covers the upper/lowercase one, but we might want one that covers
"+". (There may be more ambiguous cases, but those are the ones I
know about).

 3. Rather than check for the existence of names, decode what's actually
in the modules/ directory to create an in-memory index of names.

I hesitate to suggest that,