Re: [PATCH] specify encoding for sed command

2018-04-12 Thread Matthew Coleman
>> I think the best way to move forward with this is a new patch that uses
>> `awk` instead of `sed`: I tested several `awk` variants and the command
>> was portable without requiring any changes to LANG or LC_ALL.
>> 
>> Does that sound like a good plan?
> 
> No ;)
> Could you please give the patch below a try?
Your patch works great and is clearly a way better approach than parsing the 
output of `set`.

> I'm just not sure whether we should burden this otherwise nice and short
> commit message with the details of this Powerline-macOS-Bash-sed issue...
It might be worth mentioning it briefly, in order to reduce the chances of a 
regression in the future. Maybe something like...
"In Bash we can do better: run the 'compgen -v __gitcomp_builtin_' builtin 
command, which lists the same variables, but without the risk of errors caused 
by poor Unicode handling in some shells' 'set' builtin command or the overhead 
of executing 'sed'."


Re: [PATCH] specify encoding for sed command

2018-04-12 Thread SZEDER Gábor

> I did a little more digging into this issue today.

> The issue isn't actually caused by `sed`: it's caused by the way that
> the `set` builtin outputs characters in the Unicode Private Use Area
> (PUA) in the build of Bash 3.2.57 that macOS Sierra ships with.
> 
> Powerline uses several PUA code points to make some of its pretty text
> UI elements:
> 
> Code point (hex value): description
> U+E0A0 (0xEE82A0): Version control branch
> U+E0A1 (0xEE82A1): LN (line) symbol
> U+E0A2 (0xEE82A2): Closed padlock
> U+E0B0 (0xEE82B0): Rightwards black arrowhead
> U+E0B1 (0xEE82B1): Rightwards arrowhead
> U+E0B2 (0xEE82B2): Leftwards black arrowhead
> U+E0B3 (0xEE82B3): Leftwards arrowhead
> 
> macOS Bash 3.2.57's `set` builtin has garbled output where Powerline's
> special symbols should be in the PS1 variable, but Bash 4.4.19
> (installed on macOS via homebrew) and Bash 4.3.38 (Ubuntu 16.04) both
> display it correctly in the output of `set`. `echo $PS1` does display
> the symbols correctly on these versions of Bash.

Thanks for digging.

> I think the best way to move forward with this is a new patch that uses
> `awk` instead of `sed`: I tested several `awk` variants and the command
> was portable without requiring any changes to LANG or LC_ALL.
> 
> Does that sound like a good plan?

No ;)
Could you please give the patch below a try?

I intended it as a fix for a minor performance regression introduced in
the same commit that triggered this issue, but if it can work around
this issue, too, all the better.

I'm just not sure whether we should burden this otherwise nice and short
commit message with the details of this Powerline-macOS-Bash-sed issue...


  -- >8 --

Subject: [PATCH] completion: reduce overhead of clearing cached --options

To get the names of all '$__git_builtin_*' variables caching --options
of builtin commands in order to unset them, 8b0eaa41f2 (completion:
clear cached --options when sourcing the completion script,
2018-03-22) runs a 'set |sed s///' pipeline.  This works both in Bash
and in ZSH, but has a higher than necessasry overhead with the extra
processes.

In Bash we can do better: run the 'compgen -v __gitcomp_builtin_'
builtin command, which lists the same variables, but without a
pipeline and 'sed' it can do so with lower overhead.

Signed-off-by: SZEDER Gábor 

TODO: as an unexpected bonus, this might work around the issue with
'sed' and invalid UTF-8 characters in the environment as well, at least
in Bash.

  
https://public-inbox.org/git/0102016293c8dca7-6626fcde-548d-476e-b61f-c83ecdeedfe1-000...@eu-west-1.amazonses.com/
---
 contrib/completion/git-completion.bash | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/contrib/completion/git-completion.bash 
b/contrib/completion/git-completion.bash
index b09c8a2362..4ef59a51be 100644
--- a/contrib/completion/git-completion.bash
+++ b/contrib/completion/git-completion.bash
@@ -282,7 +282,11 @@ __gitcomp ()
 
 # Clear the variables caching builtins' options when (re-)sourcing
 # the completion script.
-unset $(set |sed -ne 
's/^\(__gitcomp_builtin_[a-zA-Z0-9_][a-zA-Z0-9_]*\)=.*/\1/p') 2>/dev/null
+if [[ -n ${ZSH_VERSION-} ]]; then
+   unset $(set |sed -ne 
's/^\(__gitcomp_builtin_[a-zA-Z0-9_][a-zA-Z0-9_]*\)=.*/\1/p') 2>/dev/null
+else
+   unset $(compgen -v __gitcomp_builtin_)
+fi
 
 # This function is equivalent to
 #
-- 
2.17.0.366.gbe216a3084



Re: [PATCH] specify encoding for sed command

2018-04-12 Thread Matthew Coleman
I did a little more digging into this issue today.

> On Apr 11, 2018, at 4:42 PM, Matt Coleman  wrote:
> 
> I found another (possibly better) way to fix this:
> 
>> On Apr 10, 2018, at 3:18 AM, Matt Coleman  wrote:
>> 
>>> 1) What platform OS / version / sed version is this on?
>> I'm experiencing this on macOS Sierra (10.12.6). The issue occurs with the 
>> OS's native sed, which is FreeBSD sed so the version number is kind of 
>> ambiguous.
>> 
>> The error goes away if I set LANG=C or LC_ALL=C or change it to use GNU sed 
>> (installed via homebrew as gsed).
> 
> If I change it to use awk instead of sed, it works with mawk, gawk, and macOS 
> awk:
> unset $(set | awk -F '=' '/^__gitcomp_builtin_[a-zA-Z0-9_][a-zA-Z0-9_]*=/ 
> {print $1}') 2>/dev/null
> 
> I compared sed vs. awk on Linux and Mac and they all take about the same 
> amount of time to run (within 0.002ms).

The issue isn't actually caused by `sed`: it's caused by the way that the `set` 
builtin outputs characters in the Unicode Private Use Area (PUA) in the build 
of Bash 3.2.57 that macOS Sierra ships with.

Powerline uses several PUA code points to make some of its pretty text UI 
elements:

Code point (hex value): description
U+E0A0 (0xEE82A0): Version control branch
U+E0A1 (0xEE82A1): LN (line) symbol
U+E0A2 (0xEE82A2): Closed padlock
U+E0B0 (0xEE82B0): Rightwards black arrowhead
U+E0B1 (0xEE82B1): Rightwards arrowhead
U+E0B2 (0xEE82B2): Leftwards black arrowhead
U+E0B3 (0xEE82B3): Leftwards arrowhead

macOS Bash 3.2.57's `set` builtin has garbled output where Powerline's special 
symbols should be in the PS1 variable, but Bash 4.4.19 (installed on macOS via 
homebrew) and Bash 4.3.38 (Ubuntu 16.04) both display it correctly in the 
output of `set`. `echo $PS1` does display the symbols correctly on these 
versions of Bash.

So...

> 3) There's other invocations of "sed" in the file, aren't those affected as 
> well?
The short answer: no. Slightly longer: not by the same thing that's affecting 
the line in the patch, at least.

Long: As described above, the problem isn't actually `sed`: it's the `set` 
builtin in macOS's build of Bash. The other invocations of `sed` should be 
safe, because `sed` properly handles the PUA code points on its own:

$ echo $'\xee\x82\xb0' | sed 's/./@/'
@

The way that `set` is displaying the PS1 variable seems to be sending them to 
`sed` individually or somehow split up:

$ for character in $'\xee' $'\x82' $'\xb0' $'\xee\x82' $'\x82\xb0'; do echo 
$character | sed 's/./@/'; done
sed: RE error: illegal byte sequence
sed: RE error: illegal byte sequence
sed: RE error: illegal byte sequence
sed: RE error: illegal byte sequence
sed: RE error: illegal byte sequence

Interestingly, Bash 3.2.25's `set` builtin on CentOS 5 correctly displays the 
octal values for the symbols (prompt edited to keep this email ASCII):

$ PS1=$'\xee\x82\xb0 '
>  set | grep PS1
PS1=$'\356\202\260 '

I haven't started digging through Bash's codebase, but it could either be a bug 
that was introduced between 3.2.25 and 3.2.57 or Apple used some silly flags 
when compiling Bash. Ideally, this should be fixed in Bash, but since Apple's 
using such an old version of Bash for license reasons, I think it's unlikely 
that they'll fix the issue.

I think the best way to move forward with this is a new patch that uses `awk` 
instead of `sed`: I tested several `awk` variants and the command was portable 
without requiring any changes to LANG or LC_ALL.

Does that sound like a good plan?


Re: [PATCH] specify encoding for sed command

2018-04-11 Thread Matt Coleman
I found another (possibly better) way to fix this:

> On Apr 10, 2018, at 3:18 AM, Matt Coleman  wrote:
> 
>> 1) What platform OS / version / sed version is this on?
> I'm experiencing this on macOS Sierra (10.12.6). The issue occurs with the 
> OS's native sed, which is FreeBSD sed so the version number is kind of 
> ambiguous.
> 
> The error goes away if I set LANG=C or LC_ALL=C or change it to use GNU sed 
> (installed via homebrew as gsed).

If I change it to use awk instead of sed, it works with mawk, gawk, and macOS 
awk:
unset $(set | awk -F '=' '/^__gitcomp_builtin_[a-zA-Z0-9_][a-zA-Z0-9_]*=/ 
{print $1}') 2>/dev/null

I compared sed vs. awk on Linux and Mac and they all take about the same amount 
of time to run (within 0.002ms).

Re: [PATCH] specify encoding for sed command

2018-04-10 Thread Matt Coleman
> 1) What platform OS / version / sed version is this on?
I'm experiencing this on macOS Sierra (10.12.6). The issue occurs with the OS's 
native sed, which is FreeBSD sed so the version number is kind of ambiguous.

The error goes away if I set LANG=C or LC_ALL=C or change it to use GNU sed 
(installed via homebrew as gsed).

> 2) What's the output from "set" that's causing this error? Do we have an
>   isolated test case for that?
On my system, it's caused by whatever powerline is doing to Bash's PS1 variable:
$ set | grep ^PS1= | sed 's/foo/bar/'
sed: RE error: illegal byte sequence

I'm not sure exactly which character is bad here:
$ xxd <<<$PS1
: 5c5b 1b5b 303b 3338 3b32 3b32 3535 3b32  \[.[0;38;2;255;2
0010: 3535 3b32 3535 3b34 383b 323b 303b 3133  55;255;48;2;0;13
0020: 353b 3137 353b 316d 5c5d c2a0 6d61 7474  5;175;1m\]..matt
0030: c2a0 5c5b 1b5b 303b 3338 3b32 3b30 3b31  ..\[.[0;38;2;0;1
0040: 3335 3b31 3735 3b34 383b 323b 3838 3b38  35;175;48;2;88;8
0050: 383b 3838 3b32 326d 5c5d ee82 b0c2 a05c  8;88;22m\].\
0060: 5b1b 5b30 3b33 383b 323b 3230 383b 3230  [.[0;38;2;208;20
0070: 383b 3230 383b 3438 3b32 3b38 383b 3838  8;208;48;2;88;88
0080: 3b38 383b 316d 5c5d 7ec2 a05c 5b1b 5b30  ;88;1m\]~..\[.[0
0090: 3b33 383b 323b 3133 383b 3133 383b 3133  ;38;2;138;138;13
00a0: 383b 3438 3b32 3b38 383b 3838 3b38 383b  8;48;2;88;88;88;
00b0: 3232 6d5c 5dee 82b1 c2a0 5c5b 1b5b 303b  22m\].\[.[0;
00c0: 3338 3b32 3b32 3038 3b32 3038 3b32 3038  38;2;208;208;208
00d0: 3b34 383b 323b 3838 3b38 383b 3838 3b31  ;48;2;88;88;88;1
00e0: 6d5c 5dc2 a0c2 a0c2 a0c2 a0c2 a0c2 a0c2  m\].
00f0: a0c2 a0c2 a0c2 a0c2 a0c2 a0c2 a0c2 a0c2  
0100: a0c2 a0c2 a0c2 a0c2 a0c2 a0c2 a0c2 a0c2  
0110: a0c2 a0c2 a0c2 a0c2 a0c2 a0c2 a0c2 a0c2  
0120: a0c2 a0c2 a0c2 a0c2 a0c2 a0c2 a0c2 a0c2  
0130: a0c2 a0c2 a0c2 a0c2 a0c2 a0c2 a0c2 a05c  ...\
0140: 5b1b 5b30 3b33 383b 323b 3438 3b34 383b  [.[0;38;2;48;48;
0150: 3438 3b34 383b 323b 3838 3b38 383b 3838  48;48;2;88;88;88
0160: 3b32 326d 5c5d c2a0 ee82 b25c 5b1b 5b30  ;22m\].\[.[0
0170: 3b33 383b 323b 3135 383b 3135 383b 3135  ;38;2;158;158;15
0180: 383b 3438 3b32 3b34 383b 3438 3b34 386d  8;48;2;48;48;48m
0190: 5c5d c2a0 3230 3138 2d30 342d 3130 5c5b  \]..2018-04-10\[
01a0: 1b5b 303b 3338 3b32 3b39 383b 3938 3b39  .[0;38;2;98;98;9
01b0: 383b 3438 3b32 3b34 383b 3438 3b34 383b  8;48;2;48;48;48;
01c0: 3232 6d5c 5dc2 a0ee 82b3 5c5b 1b5b 303b  22m\].\[.[0;
01d0: 3338 3b32 3b32 3038 3b32 3038 3b32 3038  38;2;208;208;208
01e0: 3b34 383b 323b 3438 3b34 383b 3438 3b31  ;48;2;48;48;48;1
01f0: 6d5c 5dc2 a030 333a 3136 c2a0 5c5b 1b5b  m\]..03:16..\[.[
0200: 306d 5c5d 205c 5b1b 5b30 3b33 383b 323b  0m\] \[.[0;38;2;
0210: 3135 383b 3135 383b 3135 383b 3438 3b32  158;158;158;48;2
0220: 3b34 383b 3438 3b34 386d 5c5d c2a0 c2a0  ;48;48;48m\]
0230: 5c5b 1b5b 303b 3338 3b32 3b34 383b 3438  \[.[0;38;2;48;48
0240: 3b34 383b 3439 3b32 326d 5c5d ee82 b0c2  ;48;49;22m\]
0250: a05c 5b1b 5b30 6d5c 5d0a .\[.[0m\].

Re: [PATCH] specify encoding for sed command

2018-04-08 Thread Junio C Hamano
Stephon Harris  writes:

> Fixes issue with seeing `sed: RE error: illegal byte sequence` when running 
> git-completion.bash
> ---
>  contrib/completion/git-completion.bash | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/contrib/completion/git-completion.bash 
> b/contrib/completion/git-completion.bash
> index b09c8a23626b4..52a4ab5e2165a 100644
> --- a/contrib/completion/git-completion.bash
> +++ b/contrib/completion/git-completion.bash
> @@ -282,7 +282,7 @@ __gitcomp ()
>  
>  # Clear the variables caching builtins' options when (re-)sourcing
>  # the completion script.
> -unset $(set |sed -ne 
> 's/^\(__gitcomp_builtin_[a-zA-Z0-9_][a-zA-Z0-9_]*\)=.*/\1/p') 2>/dev/null
> +unset $(set |LANG=C sed -ne 
> 's/^\(__gitcomp_builtin_[a-zA-Z0-9_][a-zA-Z0-9_]*\)=.*/\1/p') 2>/dev/null

Shouldn't LC_ALL and LANG both set and exported to C, as

 (1) ancient systems understand only LANG but not LC_*; and

 (2) modern ones that understand both give precedence to LC_ALL over
 LANG?

If we were to set only one, it is probably more sensible to set
LC_ALL, I suspect, but it may be better to set both, which sends a
sign to the readers that we know what we are doing ;-)



Re: [PATCH] specify encoding for sed command

2018-04-05 Thread SZEDER Gábor
On Thu, Apr 5, 2018 at 8:53 AM, Ævar Arnfjörð Bjarmason
 wrote:
>
> On Thu, Apr 05 2018, Stephon Harris wrote:
>
>> Fixes issue with seeing `sed: RE error: illegal byte sequence` when running 
>> git-completion.bash

Please:
  - prefix the subject line with "completion:".
  - nit: replace "running" with "sourcing".
  - wrap the body of the commit message around 70 characters.
  - sign off your patch; see Documentation/SubmittingPatches and 'git
commit -s'.

>> ---
>>  contrib/completion/git-completion.bash | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/contrib/completion/git-completion.bash 
>> b/contrib/completion/git-completion.bash
>> index b09c8a23626b4..52a4ab5e2165a 100644
>> --- a/contrib/completion/git-completion.bash
>> +++ b/contrib/completion/git-completion.bash
>> @@ -282,7 +282,7 @@ __gitcomp ()
>>
>>  # Clear the variables caching builtins' options when (re-)sourcing
>>  # the completion script.
>> -unset $(set |sed -ne 
>> 's/^\(__gitcomp_builtin_[a-zA-Z0-9_][a-zA-Z0-9_]*\)=.*/\1/p') 2>/dev/null
>> +unset $(set |LANG=C sed -ne 
>> 's/^\(__gitcomp_builtin_[a-zA-Z0-9_][a-zA-Z0-9_]*\)=.*/\1/p') 2>/dev/null

I was wondering how could you see an error message from 'sed', when the
stderr of the whole thing is redirected to /dev/null.  It turns out that
such a redirection doesn't affect the stderr of the commands executed
inside the command substitution:

  $ echo $(echo foo ; echo >&2 error) 2>/dev/null
  error
  foo

Interesting, I didn't know that.

> This is getting closer to the issue than your previous patch, but
> there's still some open questions:
>
> 1) What platform OS / version / sed version is this on?
>
> 2) What's the output from "set" that's causing this error? Do we have an
>isolated test case for that?

A quick internet search suggests that this error message tends to appear
on Macs, when its 'sed' encounters an invalid UTF-8 character in its
input.

> 3) There's other invocations of "sed" in the file, aren't those affected
>as well?

The two 'sed' invocations in _git_stash() are suspect, as they process
the output of 'git stash list', which includes the stashes'
descriptions, which can contain whatever the users wished to store there
(though they would probably get a "Warning: commit message did not
conform to UTF-8" message from 9e83266525 (commit-tree: encourage UTF-8
commit messages., 2006-12-22) when doing so).

The 'sed' invocation in __git_complete_revlist_file() is suspect as
well, as it processes the output of 'git ls-tree'.  Pathnames can
contain anything, and while 'git ls-tree' quotes pathnames with
"unusual" characters, I don't know whether it quotes all characters that
can possibly upset Stephon's 'sed'.

What about 'awk' on Stephon's platform?  We already have one 'awk'
invocation in __git_match_ctag(), which we use for 'git grep ' and
'git log -L:'.  That 'awk' script uses a regexp to match the symbol
name in a ctags file; does any programming language allow such unusual
characters in symbol names?

What about 'sed' and 'awk' scripts that don't use regexps at all?  (I
intend to add such an 'awk' script in a day or two...)

> 4) Any reason we wouldn't just set LC_AlL=C for the whole file? I see we
>already do it for our invocation to "git merge".

That would perhaps be the safest thing to do...
But how can we set it for a whole file, when the file is not executed as
a script but sourced into the user's shell environment?


Re: [PATCH] specify encoding for sed command

2018-04-05 Thread Eric Sunshine
On Thu, Apr 5, 2018 at 2:53 AM, Ævar Arnfjörð Bjarmason
 wrote:
> On Thu, Apr 05 2018, Stephon Harris wrote:
>> Fixes issue with seeing `sed: RE error: illegal byte sequence` when running 
>> git-completion.bash
>
> This is getting closer to the issue than your previous patch, but
> there's still some open questions:
>
> 1) What platform OS / version / sed version is this on?
>
> 2) What's the output from "set" that's causing this error? Do we have an
>isolated test case for that?
>
> 3) There's other invocations of "sed" in the file, aren't those affected
>as well?
>
> 4) Any reason we wouldn't just set LC_AlL=C for the whole file? I see we
>already do it for our invocation to "git merge".

Also, missing Signed-off-by: (see Documentation/SubmittingPatches).

Thanks.


Re: [PATCH] specify encoding for sed command

2018-04-05 Thread Ævar Arnfjörð Bjarmason

On Thu, Apr 05 2018, Stephon Harris wrote:

> Fixes issue with seeing `sed: RE error: illegal byte sequence` when running 
> git-completion.bash
> ---
>  contrib/completion/git-completion.bash | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/contrib/completion/git-completion.bash 
> b/contrib/completion/git-completion.bash
> index b09c8a23626b4..52a4ab5e2165a 100644
> --- a/contrib/completion/git-completion.bash
> +++ b/contrib/completion/git-completion.bash
> @@ -282,7 +282,7 @@ __gitcomp ()
>
>  # Clear the variables caching builtins' options when (re-)sourcing
>  # the completion script.
> -unset $(set |sed -ne 
> 's/^\(__gitcomp_builtin_[a-zA-Z0-9_][a-zA-Z0-9_]*\)=.*/\1/p') 2>/dev/null
> +unset $(set |LANG=C sed -ne 
> 's/^\(__gitcomp_builtin_[a-zA-Z0-9_][a-zA-Z0-9_]*\)=.*/\1/p') 2>/dev/null

This is getting closer to the issue than your previous patch, but
there's still some open questions:

1) What platform OS / version / sed version is this on?

2) What's the output from "set" that's causing this error? Do we have an
   isolated test case for that?

3) There's other invocations of "sed" in the file, aren't those affected
   as well?

4) Any reason we wouldn't just set LC_AlL=C for the whole file? I see we
   already do it for our invocation to "git merge".


[PATCH] specify encoding for sed command

2018-04-04 Thread Stephon Harris
Fixes issue with seeing `sed: RE error: illegal byte sequence` when running 
git-completion.bash
---
 contrib/completion/git-completion.bash | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/contrib/completion/git-completion.bash 
b/contrib/completion/git-completion.bash
index b09c8a23626b4..52a4ab5e2165a 100644
--- a/contrib/completion/git-completion.bash
+++ b/contrib/completion/git-completion.bash
@@ -282,7 +282,7 @@ __gitcomp ()
 
 # Clear the variables caching builtins' options when (re-)sourcing
 # the completion script.
-unset $(set |sed -ne 
's/^\(__gitcomp_builtin_[a-zA-Z0-9_][a-zA-Z0-9_]*\)=.*/\1/p') 2>/dev/null
+unset $(set |LANG=C sed -ne 
's/^\(__gitcomp_builtin_[a-zA-Z0-9_][a-zA-Z0-9_]*\)=.*/\1/p') 2>/dev/null
 
 # This function is equivalent to
 #

--
https://github.com/git/git/pull/481