Re: Detect invalid submodule names from script?
Am 17.07.2017 um 19:49 schrieb Stefan Beller: On Mon, Jul 17, 2017 at 4:17 AM, Joachim Durchholz <j...@durchholz.org> wrote: Hi all I'm hacking some script that calls into git, and I need to detect whether a repository was configured with a submodule name that will work on "git submodule init" and friends. There is no such a thing as "git submodule valid-name" unfortunately. Looking through "git submodule add", I think it is safe to assume that any string valid as a subsection in git-config is a valid submodule name. That's what I have been thinking myself. Our man page says: Subsection names are case sensitive and can contain any characters except newline (doublequote " and backslash can be included by escaping them as \" and \\, respectively). I am not sure about the quality of submodule shell code to handle the quotations for double quote and backslash correctly, so I would suggest not using them, either. The quality is pretty dubious for path names (where I have torture tests in place). I haven't done these tests for module names yet. I doubt it's going to look prettier though. I *can* run a git init and see whether it works, but I need to be 100% sure that the error was due to an invalid submodule name and not something else. Bonus points for every version of git for which it works. I do not think Git offers a universal solution across versions except actually running "submodule init" and then making an educated guess if the error comes from bad naming or something else. Yeah, I think that's the way to go - clone into a local temp directory, set up the submodule to a sane module and subdirectory name, and if that works, test what happens with the original configuration. I'd have preferred something less elaborate though. E.g. SVN has a --xml option that will output everything in XML format. (Not useful for shell scripting but easy for almost any other scripting language.) This sounds like you're taking user input or otherwise untrustworthy data as submodule names? I'm auto-downloading submodules as configured in 3rd-party repositories. Which means untrusted data is copied to the user's home directory, so I definitely need to make sure that nothing nasty can happen. My current code is a /bin/sh script. Given the security/reliability/robustness problems I keep encountering, the temptation to rewrite this in Python has been steadily growing, though I'm not quite there yet. Regards, Jo
Detect invalid submodule names from script?
Hi all I'm hacking some script that calls into git, and I need to detect whether a repository was configured with a submodule name that will work on "git submodule init" and friends. I *can* run a git init and see whether it works, but I need to be 100% sure that the error was due to an invalid submodule name and not something else. Bonus points for every version of git for which it works. Any suggestions? Thanks! Regards, Jo
Re: Mirroring for offline use - best practices?
Am 12.07.2017 um 19:40 schrieb Stefan Beller: Thanks for the feedback - it's been very, very useful to me! > Yes, a local path implies --local in git-clone, which (a) uses hardlinks > and (b) avoids some other protocol overhead. I guess (a) is the most important one for repositories large enough to make this kind of stuff matter. I had gathered so much, but I wasn't sure - my own repos aren't large enough to take any measurements. >> Ramification 1: >> >> I'm not sure how best to prepare patches for push-to-upstream. >> Is there value in collecting them locally into a push-to-upstream repo, or >> is it better to just push from each local clone individually? > > It depends on a lot of things: > * How critical is the latency in the desired workflow? > >Say you have this setup on a cruise ship and only push once when >you are in a harbor, then (a) you want to make sure you pushed everything >and (b) you care less about latency. Hence you would prefer to collect >everything in one repo so nothing gets lost. Yeah, that's the kind of scenario I'm having. Less cruise ship (I wish!) and more on-train work during commutes, but it's similar. But I think the "make sure nothing gets lost" aspect is the most important one. It's so easy to forget to push some sideline branch, particularly if you're in a workflow that uses many branches. So... an "outbox" repository would be the Right Thing for me. Not sure how generally applicable this is - what do other people think? >Say you are in a fast paced environment, where you want instant feedback >on your patches as they are mostly exploratory designs. Then you want to >push directly from the local clone individually to minimize latency, I would >imagine. That's for online work I think. Of course, there's the situation where you're sometimes offline and sometimes online. I'm not sure how to best handle that - have a script that switches configurations? Just stick with that outbox workflow because switching workflows would invite error? One thing that's specific to me is that I tend to be active in multiple projects, so I might get home and have queued up pushes in multiple repos. I'm not sure how to best make sure that I don't forget a push. OTOH maybe I'm overly optimistic about for how many repositories I might be working on a day, and it's a non-issue. > * Does a local clone have any value for having the work from >another local clone available? In that case you may want to >have all your changes accumulated into the mirror. Yeah, definitely. A repository I work on might be used as a submodule, and I might want to check the ramifications. > When a submodule gets deleted (git rm && git commit), > then all entries for that submodule in the .gitmodules file are also removed. > That seems ok, but in an ideal world we may have a tombstone in there > (e.g. the submodule.NAME.path still set) that would help for tasks like finding > all submodules in the future. I wouldn't want to use tombstone entries actually, because the content of .gitconfig and .gitmodules might have been modified for any number of reasons. The incantations that I'm using for my own "gitmirror" script are: 1. Get all commits that touch .gitmodules via git rev-list --all --full-history -- .gitmodules 2. Get a list of all module names mentioned in that .gitmodules version via git config \ --blob "${commit}:.gitmodules" \ --name-only \ --get-regexp "^submodule\..*\.path$" 3. Given the module name, extract path and url via git config \ --blob "${commit}:.gitmodules" \ --get "submodule.${module_name}.path" resp. --get "submodule.${module_name}.url" \ It's not the most efficient way conceivable, but it's using git's configuration parser, and it won't get put off by manual edits in configuration files . It's nothing you'd do on the command line, hence the scripting :-) >> I'm seeing the --recurse-submodules option for git fetch, so this might (or >> might not) be the Right Thing. > > That only works for currently initialized (active) submodules. The submodules > of the past and those which you do not have, are not fetched. Aww. Ah well. > Without the submodule ramifications, I would have advised to have > have the local mirror a 'bare' repo. I'm currently steering towards having a cache for all repositories I ever downloaded, which would live in ~/.cache/gitmirror. I can turn this script into a public, maintained project if there's interest. Current state is two files, the script itself and a unit test script. It's still in a state of flux so sharing would be for the code reviews at this time, general usefulness would come later. Regards, Jo
Mirroring for offline use - best practices?
Hi all, I'm pretty sure this is a FAQ, but articles I found on the Internet were either mere "recipes" (i.e. tell you how, but don't explain why), or bogged down in so many details that I was never sure how to proceed from there. Basic situation: There's a master repository (Github or corporate or whatever), and I want to set up a local mirror so that I can create clones without having to access the original upstream. I'd like to set the mirror up so that creating a clone from it will automatically set up things to "just work": I.e. branches will track the mirror, not upstream, possibly other settings that I'm not aware of. I gather that local clones are fast because hardlinked - is that correct? Is that correct on Windows? (I can't easily avoid Windows.) Ramification 1: I'm not sure how best to prepare patches for push-to-upstream. Is there value in collecting them locally into a push-to-upstream repo, or is it better to just push from each local clone individually? Ramification 2: Some of the repos I work with use submodules. Sometimes they use submodules that I'm not aware of. Or a submodule was used historically, and git bisect breaks/misbehaves because it can't get the submodule in offline mode. Is there a way to get these, without writing a script that recurses through all versions of .gitmodules? I'm seeing the --recurse-submodules option for git fetch, so this might (or might not) be the Right Thing. Any thoughts welcome, thanks! Regards, Jo
Re: [PATCH 2/2] test-lib: exhaustively insert non-alnum ASCII into the TRASH_DIRECTORY name
Am 11.04.2017 um 01:23 schrieb Ævar Arnfjörð Bjarmason: > * Much of it fails due to GIT_CEILING_DIRECTORIES not working with > dirs with ":" in the name. Oh. That might hit me: I'm using URLs for parent directory names in a cache directory. urlencode may or may not work: > t0021-conversion.sh = 37(%), 58(:) Even if it works, I'll have a cache directory with pretty unreadable names. I had hoped I could avoid this kind of ugliness.
Re: [PATCH 2/2] test-lib: exhaustively insert non-alnum ASCII into the TRASH_DIRECTORY name
Am 10.04.2017 um 18:57 schrieb Jeff King: If there are security bugs where a malicious input can cause us to do something bad, that's something to care about. But that's very different than asking "do these tests run to completion with a funny input". If the tests do not complete, git is doing something unexpected. I very much disagree with that. Git's test operate under a set of assumptions, and if you violate those assumptions, then the failures are not meaningful. In that case the tests do not validate that git can properly work with special characters. That's a pretty big coverage gap. Take alternates, for instance. The on-disk storage format cannot represent paths with newlines in them. If a test does: git clone -s "$(pwd)" parent.git child && test -d child then that test is going to fail if the test directory has a newline in it. But that doesn't tell us anything meaningful. Maybe there is a bug and maybe there isn't, but we cannot know because the thing being tested cannot possibly work under the test environment given. Sure. Not all tests are meaningful in all environments. That doesn't mean that the tests are generally meaningless. Also, we're talking about two pretty different things here: gits interaction with the file system, and git's interaction with whatever shell its scripts are using. In an ideal world, these two aspects would be orthogonal and could be tested independently of each other. Since in practice we do have correlations and dependencies, this isn't always possible, but it's what we should aim for. You can rewrite all the tests to say "skip this test if there's a newline in the test directory". But to what end? It's work to write and to maintain, and we haven't gained new information. Not on that "alternates" thing (whatever that is), but we have a test that will work and provide information on systems that do allow newlines. That in itself is not a security hole, but there's a pretty good chance that at least one of these ~230 unexpected things can be turned into one, given enough time and motivation. The risk multiplies as this is shell scripting, where the path from "string is misinterpreted" to "string is run as a command" is considerably shorter than in other languages. Sure, and I'd encourage people who are interested to dig through the results and see if they can find a real problem. I looked at several and didn't find anything that wasn't an example of the "test assumptions" thing above. Don't assume that there's no risk just because you didn't find anything. Also, git might not be the actual hole, but other software that relies on git not doing anything awkward might fail that assumption and expose the actual breach of security. You could argue that it's not a problem in git, but it is. Unless and until the git docs clearly state that things may break if "funny characters" are being used, and where. But again, I'm happy to be proven wrong. With security, you need to be confident about the absence of /any/ type of hole, not the absence of a /specific/ type of hole such as a shell injection. So the way forward isn't proving you wrong by providing a specific exploit, it's making sure that no exploit with URLs and file names can possibly exist. Now if I'm reading things like "heuristics" and "git uses URL-specific code even after it has determined it's not a URL"... well, that's the exact opposite of reassuring messages. > I just don't think plastering control characters into the test directory names all the time is a good way of finding those problems (and doesn't balance out the cost). Fair enough.
Re: [PATCH 2/2] test-lib: exhaustively insert non-alnum ASCII into the TRASH_DIRECTORY name
Am 10.04.2017 um 15:38 schrieb Jeff King: Are those bugs? Maybe. Certainly they are limitations. But are they ones anybody _cares_ about? I think this may fall under "if it hurts, don't do it". It's not always possible to avoid that. URLs, for example, may contain "funny characters", including multi-byte characters of which the second byte is 0x0a. If they are guaranteed to always be URL-encoded this isn't a problem, but then we still need to make sure that URL-encoding does happen. Next source of funny characters that comes to my mind is submodules. They derive their name from the URL by default, and the subdirectory name as well. Again, consider the multibyte name where the second character is 0x0a. Or 0x80: À (uppercase A with accent grave) happens to have that byte in UTF-8 encoding, Ẁ is U+1E80 which would be encoded as 0x80 0x1e on an NTFS filesystem (barring additional coding steps in APIs or webservices, which further complicate the situation but don't usually eliminate the problem, they just shift it around). > If there are security bugs where a malicious input can cause us to do something bad, that's something to care about. But that's very different than asking "do these tests run to completion with a funny input". If the tests do not complete, git is doing something unexpected. That in itself is not a security hole, but there's a pretty good chance that at least one of these ~230 unexpected things can be turned into one, given enough time and motivation. The risk multiplies as this is shell scripting, where the path from "string is misinterpreted" to "string is run as a command" is considerably shorter than in other languages.
Re: [PATCH 0/2] test: Detect *lots* of bugs by adding non-alnum to trash dir names
Am 09.04.2017 um 21:11 schrieb Ævar Arnfjörð Bjarmason: This series changes the test library so that we interpolate the result of: perl -e 'print join q[], grep { /[^[:alnum:]]/ and !m<[./]> } map chr, 0x01..0x7f' Into the trash directory name we generate. Doing this makes 30% of the test suite fail. Wow. I know that I tend to bring out weird errors in systems, but this is much deeper than I could have ever expected. You might want to additionally test directory names like "asdf" and 'asdf' - some shell might try to be smart about such names and automatically unquote them. This would probably mean running the test suite multiple times, which some people find offensive (myself included). There's also high-bit characters that might cause oddness. 0x80 and 0xff look like interesting edge cases to me. Note that just appending these bytes is going to cause failures with filesystems that expect UTF-8 file names, you'd want to use a Unicode code point that gets encoded using a 0x80 resp. 0xff byte (I suspect 0xff is actually not a valid UTF-8 encoding at all but I'd have to double-check that). Is git supposed to run fine under Windows, or under Linux with an ntfs filesystem driver? If yes, then I'd expect this patch to cause the other 70% of tests to crash, because NTFS forbids several non-alnum printable characters, among them <=>\
Re: [PATCH] submodule: prevent backslash expantion in submodule names
Am 08.04.2017 um 12:59 schrieb Jeff King: The reason I mentioned escaping earlier is I wondered what would happen when the submodule starts with a double-quote, or has a newline in the name. I have tested newlines within the name, these work fine. I also tested double and single quotes within the name, but not at beginning or end. So I think your patch is fine there. But it does raise a few concerns. It looks like git-submodule does not cope well with exotic filenames: $ git submodule add /some/repo "$(printf 'sub with\nnewline')" Cloning into '/home/peff/tmp/sub with newline'... done. error: invalid key (newline): submodule.sub with newline.url error: invalid key (newline): submodule.sub with newline.path Failed to register submodule 'sub with newline' Strange. I'm running essentially the same kind of request, and things work fine. Might be due to me using Python3 instead of bash, or maybe due to different versions of git. If anybody is interested, I can publish my test code on github, it was scheduled to land there anyway. I'm not too worried about that. It's a nonsense request, and our config format has no syntactic mechanism to represent that key. Oh. I've been thinking that the quoted format is exactly for that kind of stuff. Though it might be prone to eol conversion if a submodule name contains crlf sequences. Also, funny behavour. Experience has taught me that funny behaviour, if it isn't exploitable today, may combine with some new funny behaviour in a future version of the same software. So I'm worried even with that. This is starting to look like a can of worms to me... one way to "close the lid" would be if git * defined what's a valid submodule name, * rejected invalid submodule names, and * documented validity rules in the git-submodule docs. YMMV, just my 2 cents :-) Regards, Jo
Re: [PATCH] submodule: prevent backslash expantion in submodule names
Thanks!
Re: problem with backslash in directory name
Am 07.04.2017 um 08:30 schrieb Jeff King: I also don't know how some of those loops would cope with a submodule name that needed quoting). "git submodule add" worked fine with most of the following names: "sub" # potentially confusing the shell "sub with blanks", "sub with\nnewline", "sub with'single quote", "sub with\"double quote", "sub with\\backslash", "sub with\bbackspace", "sub with\thorizontal tab", # potentially confusing git's configuration format "sub with #", "sub with =" (That's Python 3 literals in case somebody is wondering. I'm using Python to unit test a shell script, just so I can catch this sort of stuff...)
Re: problem with backslash in directory name
Am 07.04.2017 um 08:30 schrieb Jeff King: Probably it's "read" which does backslash expansion, but nothing else. Just grepping git-submodule.sh, some of the "read" calls should probably be "read -r" http://wiki.bash-hackers.org/commands/builtin/read has this to say: Essentially all you need to know about -r is to ALWAYS use it. The > exact behavior you get without -r is completely useless even for weird > purposes. It basically allows the escaping of input which matches > something in IFS, and also escapes line continuations. It's explained > pretty well in the POSIX read[1] spec. [1] http://pubs.opengroup.org/onlinepubs/9699919799/utilities/read.html#tag_20_109 (That's the kind of stuff that makes me shy of using shell scripts - always yet another surprise in waiting...)
problem with backslash in directory name
Hi all, I'm having a problem with submodules that reside in directories that (unwisely) contain a backslash in their name. Transcript: ### Arrange $ git init main Initialized empty Git repository in /tmp/test/main/.git/ $ git init sub\\with\\backslash Initialized empty Git repository in /tmp/test/sub\with\backslash/.git/ # This looks okay: the shell interpreted \\ as \, # so we get sub\with\backslash # Create a log entry in sub\with\backslash # (it can't be added as a submodule otherwise) # ((actually I think it's a misfeature, my current use case would be # easier if git didn't insist on having a log in submodules)) $ touch sub\\with\\backslash/empty.file $ git -C sub\\with\\backslash add empty.file $ git -C sub\\with\\backslash commit -m "Added empty.file" [master (root-commit) a27a485] Added empty.file 1 file changed, 0 insertions(+), 0 deletions(-) create mode 100644 empty.file ### Act/Assert $ git -C main submodule add ../sub\\with\\backslash fatal: repository '/tmp/test/sub\witackslash' does not exist fatal: clone of '/tmp/test/sub\witackslash' into submodule path 'sub\with\backslash' failed # The first "fatal:" line talks about "witackslash" # Um... "ackslash"? Now that's a nice nickname for a CVE :-) # Okay, let's see what's actually in that message $ git -C main submodule add ../sub\\with\\backslash 2>&1 | xxd : 6661 7461 6c3a 2072 6570 6f73 6974 6f72 fatal: repositor 0010: 7920 272f 746d 702f 7465 7374 2f73 7562 y '/tmp/test/sub 0020: 5c77 6974 6808 6163 6b73 6c61 7368 2720 \with.ackslash' 0030: 646f 6573 206e 6f74 2065 7869 7374 0a66 does not exist.f 0040: 6174 616c 3a20 636c 6f6e 6520 6f66 2027 atal: clone of ' 0050: 2f74 6d70 2f74 6573 742f 7375 625c 7769 /tmp/test/sub\wi 0060: 7468 0861 636b 736c 6173 6827 2069 6e74 th.ackslash' int 0070: 6f20 7375 626d 6f64 756c 6520 7061 7468 o submodule path 0080: 2027 7375 625c 7769 7468 5c62 6163 6b73 'sub\with\backs 0090: 6c61 7368 2720 6661 696c 6564 0a lash' failed. # Yeah, there's a 0x08 at offset 0x25. # It's pretty strange that it is eliding the w following the \b, # not the h preceding it. So... something inside "git submodule add" is replacing the \b with a backspace control code. Next I tried something nasty: $ mv sub\\with\\backslash 'sub: $(bc)' git -C main submodule add '../sub: $(bc)' Cloning into ' $(bc)'... done. Whatever that "something" is, it is not doing shell expansion, otherwise it would have started an interactive calculator session. Phew :-) I'm still a bit uneasy because I don't know what other escape sequences might get interpreted, and what their effects are.