Re: Detect invalid submodule names from script?

2017-07-17 Thread Joachim Durchholz

Am 17.07.2017 um 19:49 schrieb Stefan Beller:

On Mon, Jul 17, 2017 at 4:17 AM, Joachim Durchholz <j...@durchholz.org> wrote:

Hi all

I'm hacking some script that calls into git, and I need to detect whether a
repository was configured with a submodule name that will work on "git
submodule init" and friends.


There is no such a thing as "git submodule valid-name" unfortunately.
Looking through "git submodule add", I think it is safe to assume
that any string valid as a subsection in git-config is a valid submodule name.


That's what I have been thinking myself.


Our man page says:

 Subsection names are case sensitive and can contain any characters
 except newline (doublequote " and backslash can be included by escaping
 them as \" and \\, respectively).

I am not sure about the quality of submodule shell code to handle the quotations
for double quote and backslash correctly, so I would suggest not using them,
either.


The quality is pretty dubious for path names (where I have torture tests 
in place).
I haven't done these tests for module names yet. I doubt it's going to 
look prettier though.



I *can* run a git init and see whether it works, but I need to be 100% sure
that the error was due to an invalid submodule name and not something else.
Bonus points for every version of git for which it works.


I do not think Git offers a universal solution across versions except actually
running "submodule init" and then making an educated guess if the error
comes from bad naming or something else.


Yeah, I think that's the way to go - clone into a local temp directory, 
set up the submodule to a sane module and subdirectory name, and if that 
works, test what happens with the original configuration.


I'd have preferred something less elaborate though.
E.g. SVN has a --xml option that will output everything in XML format. 
(Not useful for shell scripting but easy for almost any other scripting 
language.)



This sounds like you're taking user input or otherwise untrustworthy data
as submodule names?


I'm auto-downloading submodules as configured in 3rd-party repositories.
Which means untrusted data is copied to the user's home directory, so I 
definitely need to make sure that nothing nasty can happen.


My current code is a /bin/sh script.
Given the security/reliability/robustness problems I keep encountering, 
the temptation to rewrite this in Python has been steadily growing, 
though I'm not quite there yet.


Regards,
Jo


Detect invalid submodule names from script?

2017-07-17 Thread Joachim Durchholz

Hi all

I'm hacking some script that calls into git, and I need to detect 
whether a repository was configured with a submodule name that will work 
on "git submodule init" and friends.
I *can* run a git init and see whether it works, but I need to be 100% 
sure that the error was due to an invalid submodule name and not 
something else. Bonus points for every version of git for which it works.


Any suggestions?
Thanks!

Regards,
Jo


Re: Mirroring for offline use - best practices?

2017-07-12 Thread Joachim Durchholz

Am 12.07.2017 um 19:40 schrieb Stefan Beller:

Thanks for the feedback - it's been very, very useful to me!

> Yes, a local path implies --local in git-clone, which (a) uses hardlinks
> and (b) avoids some other protocol overhead.

I guess (a) is the most important one for repositories large enough to 
make this kind of stuff matter.
I had gathered so much, but I wasn't sure - my own repos aren't large 
enough to take any measurements.


>> Ramification 1:
>>
>> I'm not sure how best to prepare patches for push-to-upstream.
>> Is there value in collecting them locally into a push-to-upstream 
repo, or

>> is it better to just push from each local clone individually?
>
> It depends on a lot of things:
> * How critical is the latency in the desired workflow?
>
>Say you have this setup on a cruise ship and only push once when
>you are in a harbor, then (a) you want to make sure you pushed 
everything

>and (b) you care less about latency. Hence you would prefer to collect
>everything in one repo so nothing gets lost.

Yeah, that's the kind of scenario I'm having.
Less cruise ship (I wish!) and more on-train work during commutes, but 
it's similar.


But I think the "make sure nothing gets lost" aspect is the most 
important one. It's so easy to forget to push some sideline branch, 
particularly if you're in a workflow that uses many branches.


So... an "outbox" repository would be the Right Thing for me.
Not sure how generally applicable this is - what do other people think?

>Say you are in a fast paced environment, where you want instant 
feedback
>on your patches as they are mostly exploratory designs. Then you 
want to
>push directly from the local clone individually to minimize 
latency, I would

>imagine.

That's for online work I think.

Of course, there's the situation where you're sometimes offline and 
sometimes online.
I'm not sure how to best handle that - have a script that switches 
configurations? Just stick with that outbox workflow because switching 
workflows would invite error?


One thing that's specific to me is that I tend to be active in multiple 
projects, so I might get home and have queued up pushes in multiple repos.

I'm not sure how to best make sure that I don't forget a push.
OTOH maybe I'm overly optimistic about for how many repositories I might 
be working on a day, and it's a non-issue.


> * Does a local clone have any value for having the work from
>another local clone available? In that case you may want to
>have all your changes accumulated into the mirror.

Yeah, definitely.
A repository I work on might be used as a submodule, and I might want to 
check the ramifications.


> When a submodule gets deleted (git rm  && git commit),
> then all entries for that submodule in the .gitmodules file are also 
removed.

> That seems ok, but in an ideal world we may have a tombstone in there
> (e.g. the submodule.NAME.path still set) that would help for tasks 
like finding

> all submodules in the future.

I wouldn't want to use tombstone entries actually, because the content 
of .gitconfig and .gitmodules might have been modified for any number of 
reasons.


The incantations that I'm using for my own "gitmirror" script are:
1. Get all commits that touch .gitmodules via
  git rev-list --all --full-history -- .gitmodules
2. Get a list of all module names mentioned in that .gitmodules version via
  git config \
  --blob "${commit}:.gitmodules" \
  --name-only \
  --get-regexp "^submodule\..*\.path$"
3. Given the module name, extract path and url via
  git config \
--blob "${commit}:.gitmodules" \
--get "submodule.${module_name}.path"
resp.
--get "submodule.${module_name}.url" \

It's not the most efficient way conceivable, but it's using git's 
configuration parser, and it won't get put off by manual edits in 
configuration files .


It's nothing you'd do on the command line, hence the scripting :-)

>> I'm seeing the --recurse-submodules option for git fetch, so this 
might (or

>> might not) be the Right Thing.
>
> That only works for currently initialized (active) submodules. The 
submodules

> of the past and those which you do not have, are not fetched.

Aww.
Ah well.

> Without the submodule ramifications, I would have advised to have
> have the local mirror a 'bare' repo.
I'm currently steering towards having a cache for all repositories I 
ever downloaded, which would live in ~/.cache/gitmirror.


I can turn this script into a public, maintained project if there's 
interest.
Current state is two files, the script itself and a unit test script. 
It's still in a state of flux so sharing would be for the code reviews 
at this time, general usefulness would come later.


Regards,
Jo


Mirroring for offline use - best practices?

2017-07-12 Thread Joachim Durchholz

Hi all,

I'm pretty sure this is a FAQ, but articles I found on the Internet were 
either mere "recipes" (i.e. tell you how, but don't explain why), or 
bogged down in so many details that I was never sure how to proceed from 
there.



Basic situation:

There's a master repository (Github or corporate or whatever), and I 
want to set up a local mirror so that I can create clones without having 
to access the original upstream.
I'd like to set the mirror up so that creating a clone from it will 
automatically set up things to "just work": I.e. branches will track the 
mirror, not upstream, possibly other settings that I'm not aware of.


I gather that local clones are fast because hardlinked - is that correct?
Is that correct on Windows? (I can't easily avoid Windows.)


Ramification 1:

I'm not sure how best to prepare patches for push-to-upstream.
Is there value in collecting them locally into a push-to-upstream repo, 
or is it better to just push from each local clone individually?



Ramification 2:

Some of the repos I work with use submodules. Sometimes they use 
submodules that I'm not aware of. Or a submodule was used historically, 
and git bisect breaks/misbehaves because it can't get the submodule in 
offline mode.
Is there a way to get these, without writing a script that recurses 
through all versions of .gitmodules?
I'm seeing the --recurse-submodules option for git fetch, so this might 
(or might not) be the Right Thing.



Any thoughts welcome, thanks!

Regards,
Jo


Re: [PATCH 2/2] test-lib: exhaustively insert non-alnum ASCII into the TRASH_DIRECTORY name

2017-04-11 Thread Joachim Durchholz

Am 11.04.2017 um 01:23 schrieb Ævar Arnfjörð Bjarmason:
> * Much of it fails due to GIT_CEILING_DIRECTORIES not working with
> dirs with ":" in the name.

Oh. That might hit me: I'm using URLs for parent directory names in a 
cache directory.


urlencode may or may not work:

> t0021-conversion.sh = 37(%), 58(:)

Even if it works, I'll have a cache directory with pretty unreadable 
names. I had hoped I could avoid this kind of ugliness.


Re: [PATCH 2/2] test-lib: exhaustively insert non-alnum ASCII into the TRASH_DIRECTORY name

2017-04-10 Thread Joachim Durchholz

Am 10.04.2017 um 18:57 schrieb Jeff King:

If there are security bugs where a malicious input can cause us
to do something bad, that's something to care about. But that's very
different than asking "do these tests run to completion with a funny
input".


If the tests do not complete, git is doing something unexpected.


I very much disagree with that. Git's test operate under a set of
assumptions, and if you violate those assumptions, then the failures are
not meaningful.


In that case the tests do not validate that git can properly work with 
special characters.

That's a pretty big coverage gap.


Take alternates, for instance. The on-disk storage format cannot
represent paths with newlines in them. If a test does:

  git clone -s "$(pwd)" parent.git child &&
  test -d child

then that test is going to fail if the test directory has a newline in
it. But that doesn't tell us anything meaningful. Maybe there is a bug
and maybe there isn't, but we cannot know because the thing being tested
cannot possibly work under the test environment given.


Sure. Not all tests are meaningful in all environments.
That doesn't mean that the tests are generally meaningless.

Also, we're talking about two pretty different things here: gits 
interaction with the file system, and git's interaction with whatever 
shell its scripts are using.
In an ideal world, these two aspects would be orthogonal and could be 
tested independently of each other. Since in practice we do have 
correlations and dependencies, this isn't always possible, but it's what 
we should aim for.



You can rewrite all the tests to say "skip this test if there's a
newline in the test directory". But to what end? It's work to write and
to maintain, and we haven't gained new information.


Not on that "alternates" thing (whatever that is), but we have a test 
that will work and provide information on systems that do allow newlines.



That in itself is not a security hole, but there's a pretty good chance that
at least one of these ~230 unexpected things can be turned into one, given
enough time and motivation. The risk multiplies as this is shell scripting,
where the path from "string is misinterpreted" to "string is run as a
command" is considerably shorter than in other languages.


Sure, and I'd encourage people who are interested to dig through the
results and see if they can find a real problem. I looked at several and
didn't find anything that wasn't an example of the "test assumptions"
thing above.


Don't assume that there's no risk just because you didn't find anything.

Also, git might not be the actual hole, but other software that relies 
on git not doing anything awkward might fail that assumption and expose 
the actual breach of security.
You could argue that it's not a problem in git, but it is. Unless and 
until the git docs clearly state that things may break if "funny 
characters" are being used, and where.



But again, I'm happy to be proven wrong.


With security, you need to be confident about the absence of /any/ type 
of hole, not the absence of a /specific/ type of hole such as a shell 
injection.
So the way forward isn't proving you wrong by providing a specific 
exploit, it's making sure that no exploit with URLs and file names can 
possibly exist.


Now if I'm reading things like "heuristics" and "git uses URL-specific 
code even after it has determined it's not a URL"... well, that's the 
exact opposite of reassuring messages.


> I just don't think

plastering control characters into the test directory names all the time
is a good way of finding those problems (and doesn't balance out the
cost).


Fair enough.


Re: [PATCH 2/2] test-lib: exhaustively insert non-alnum ASCII into the TRASH_DIRECTORY name

2017-04-10 Thread Joachim Durchholz

Am 10.04.2017 um 15:38 schrieb Jeff King:

Are those bugs? Maybe. Certainly they are limitations. But are they ones
anybody _cares_ about?  I think this may fall under "if it hurts, don't
do it".


It's not always possible to avoid that.

URLs, for example, may contain "funny characters", including multi-byte 
characters of which the second byte is 0x0a. If they are guaranteed to 
always be URL-encoded this isn't a problem, but then we still need to 
make sure that URL-encoding does happen.


Next source of funny characters that comes to my mind is submodules. 
They derive their name from the URL by default, and the subdirectory 
name as well. Again, consider the multibyte name where the second 
character is 0x0a. Or 0x80: À (uppercase A with accent grave) happens to 
have that byte in UTF-8 encoding, Ẁ is U+1E80 which would be encoded as 
0x80 0x1e on an NTFS filesystem (barring additional coding steps in APIs 
or webservices, which further complicate the situation but don't usually 
eliminate the problem, they just shift it around).


> If there are security bugs where a malicious input can cause us

to do something bad, that's something to care about. But that's very
different than asking "do these tests run to completion with a funny
input".


If the tests do not complete, git is doing something unexpected.
That in itself is not a security hole, but there's a pretty good chance 
that at least one of these ~230 unexpected things can be turned into 
one, given enough time and motivation. The risk multiplies as this is 
shell scripting, where the path from "string is misinterpreted" to 
"string is run as a command" is considerably shorter than in other 
languages.


Re: [PATCH 0/2] test: Detect *lots* of bugs by adding non-alnum to trash dir names

2017-04-09 Thread Joachim Durchholz

Am 09.04.2017 um 21:11 schrieb Ævar Arnfjörð Bjarmason:

This series changes the test library so that we interpolate the result
of:

perl -e 'print join q[], grep { /[^[:alnum:]]/ and !m<[./]> } map chr, 
0x01..0x7f'

Into the trash directory name we generate. Doing this makes 30% of the
test suite fail.


Wow.
I know that I tend to bring out weird errors in systems, but this is 
much deeper than I could have ever expected.


You might want to additionally test directory names like "asdf" and 
'asdf' - some shell might try to be smart about such names and 
automatically unquote them. This would probably mean running the test 
suite multiple times, which some people find offensive (myself included).


There's also high-bit characters that might cause oddness. 0x80 and 0xff 
look like interesting edge cases to me.
Note that just appending these bytes is going to cause failures with 
filesystems that expect UTF-8 file names, you'd want to use a Unicode 
code point that gets encoded using a 0x80 resp. 0xff byte (I suspect 
0xff is actually not a valid UTF-8 encoding at all but I'd have to 
double-check that).


Is git supposed to run fine under Windows, or under Linux with an ntfs 
filesystem driver?
If yes, then I'd expect this patch to cause the other 70% of tests to 
crash, because NTFS forbids several non-alnum printable characters, 
among them <=>\


Re: [PATCH] submodule: prevent backslash expantion in submodule names

2017-04-08 Thread Joachim Durchholz

Am 08.04.2017 um 12:59 schrieb Jeff King:

The reason I mentioned escaping earlier is I wondered what would happen
when the submodule starts with a double-quote, or has a newline in the
name.


I have tested newlines within the name, these work fine.

I also tested double and single quotes within the name, but not at 
beginning or end.



So I think your patch is fine there. But it does raise a few concerns.
It looks like git-submodule does not cope well with exotic filenames:

  $ git submodule add /some/repo "$(printf 'sub with\nnewline')"
  Cloning into '/home/peff/tmp/sub with
  newline'...
  done.
  error: invalid key (newline): submodule.sub with
  newline.url
  error: invalid key (newline): submodule.sub with
  newline.path
  Failed to register submodule 'sub with
  newline'


Strange. I'm running essentially the same kind of request, and things 
work fine.
Might be due to me using Python3 instead of bash, or maybe due to 
different versions of git.


If anybody is interested, I can publish my test code on github, it was 
scheduled to land there anyway.



I'm not too worried about that.  It's a nonsense request, and our config
format has no syntactic mechanism to represent that key.


Oh. I've been thinking that the quoted format is exactly for that kind 
of stuff.
Though it might be prone to eol conversion if a submodule name contains 
crlf sequences.


Also, funny behavour. Experience has taught me that funny behaviour, if 
it isn't exploitable today, may combine with some new funny behaviour in 
a future version of the same software. So I'm worried even with that.


This is starting to look like a can of worms to me... one way to "close 
the lid" would be if git

* defined what's a valid submodule name,
* rejected invalid submodule names, and
* documented validity rules in the git-submodule docs.
YMMV, just my 2 cents :-)

Regards,
Jo


Re: [PATCH] submodule: prevent backslash expantion in submodule names

2017-04-07 Thread Joachim Durchholz

Thanks!


Re: problem with backslash in directory name

2017-04-07 Thread Joachim Durchholz

Am 07.04.2017 um 08:30 schrieb Jeff King:

I also don't know how some of those loops would cope with
a submodule name that needed quoting).


"git submodule add" worked fine with most of the following names:

"sub"
# potentially confusing the shell
"sub with blanks",
"sub with\nnewline",
"sub with'single quote",
"sub with\"double quote",
"sub with\\backslash",
"sub with\bbackspace",
"sub with\thorizontal tab",
# potentially confusing git's configuration format
"sub with #",
"sub with ="

(That's Python 3 literals in case somebody is wondering. I'm using 
Python to unit test a shell script, just so I can catch this sort of 
stuff...)


Re: problem with backslash in directory name

2017-04-07 Thread Joachim Durchholz

Am 07.04.2017 um 08:30 schrieb Jeff King:

Probably it's "read" which does backslash expansion, but nothing else.
Just grepping git-submodule.sh, some of the "read" calls should probably
be "read -r"


http://wiki.bash-hackers.org/commands/builtin/read has this to say:


Essentially all you need to know about -r is to ALWAYS use it. The

> exact behavior you get without -r is completely useless even for weird
> purposes. It basically allows the escaping of input which matches
> something in IFS, and also escapes line continuations. It's explained
> pretty well in the POSIX read[1] spec.

[1] 
http://pubs.opengroup.org/onlinepubs/9699919799/utilities/read.html#tag_20_109


(That's the kind of stuff that makes me shy of using shell scripts - 
always yet another surprise in waiting...)


problem with backslash in directory name

2017-04-07 Thread Joachim Durchholz

Hi all,

I'm having a problem with submodules that reside in directories that 
(unwisely) contain a backslash in their name.



Transcript:

### Arrange

$ git init main
Initialized empty Git repository in /tmp/test/main/.git/

$ git init sub\\with\\backslash
Initialized empty Git repository in /tmp/test/sub\with\backslash/.git/
# This looks okay: the shell interpreted \\ as \,
# so we get sub\with\backslash

# Create a log entry in sub\with\backslash
# (it can't be added as a submodule otherwise)
# ((actually I think it's a misfeature, my current use case would be
# easier if git didn't insist on having a log in submodules))
$ touch sub\\with\\backslash/empty.file
$ git -C sub\\with\\backslash add empty.file
$ git -C sub\\with\\backslash commit -m "Added empty.file"
[master (root-commit) a27a485] Added empty.file
 1 file changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 empty.file

### Act/Assert

$ git -C main submodule add ../sub\\with\\backslash
fatal: repository '/tmp/test/sub\witackslash' does not exist
fatal: clone of '/tmp/test/sub\witackslash' into submodule path 
'sub\with\backslash' failed

# The first "fatal:" line talks about "witackslash"
# Um... "ackslash"? Now that's a nice nickname for a CVE :-)

# Okay, let's see what's actually in that message
$ git -C main submodule add ../sub\\with\\backslash 2>&1 | xxd
: 6661 7461 6c3a 2072 6570 6f73 6974 6f72  fatal: repositor
0010: 7920 272f 746d 702f 7465 7374 2f73 7562  y '/tmp/test/sub
0020: 5c77 6974 6808 6163 6b73 6c61 7368 2720  \with.ackslash'
0030: 646f 6573 206e 6f74 2065 7869 7374 0a66  does not exist.f
0040: 6174 616c 3a20 636c 6f6e 6520 6f66 2027  atal: clone of '
0050: 2f74 6d70 2f74 6573 742f 7375 625c 7769  /tmp/test/sub\wi
0060: 7468 0861 636b 736c 6173 6827 2069 6e74  th.ackslash' int
0070: 6f20 7375 626d 6f64 756c 6520 7061 7468  o submodule path
0080: 2027 7375 625c 7769 7468 5c62 6163 6b73   'sub\with\backs
0090: 6c61 7368 2720 6661 696c 6564 0a lash' failed.

# Yeah, there's a 0x08 at offset 0x25.
# It's pretty strange that it is eliding the w following the \b,
# not the h preceding it.


So... something inside "git submodule add" is replacing the \b with a 
backspace control code.



Next I tried something nasty:

$ mv sub\\with\\backslash 'sub: $(bc)'
git -C main submodule add '../sub: $(bc)'
Cloning into ' $(bc)'...
done.

Whatever that "something" is, it is not doing shell expansion, otherwise 
it would have started an interactive calculator session.

Phew :-)
I'm still a bit uneasy because I don't know what other escape sequences 
might get interpreted, and what their effects are.