Re: [PATCH] contrib: add unicode/utf8-dump.py

2021-11-01 Thread Martin Liška

On 11/1/21 16:32, David Malcolm wrote:

|Thanks. Here's an updated version of the script that fixes the above issues.|


Thanks. Please install it (there are no strict approval rules for contrib 
folder).

Cheers,
Martin


[PATCH] contrib: add unicode/utf8-dump.py

2021-11-01 Thread David Malcolm via Gcc-patches
On Mon, 2021-11-01 at 15:36 +0100, Martin Liška wrote:
> On 11/1/21 15:14, David Malcolm via Gcc-patches wrote:
> > > This script may be useful when debugging issues relating to
> > > Unicode encoding (e.g. when investigating source files with
> > > bidirectional control characters).|
> 
> I like the script except the following flake8 issues:
> 
> $ flake8 contrib/unicode/utf8-dump.py
> contrib/unicode/utf8-dump.py:35:1: E302 expected 2 blank lines, found
> 1
> contrib/unicode/utf8-dump.py:43:1: E302 expected 2 blank lines, found
> 1
> contrib/unicode/utf8-dump.py:53:1: E302 expected 2 blank lines, found
> 1
> contrib/unicode/utf8-dump.py:64:1: E305 expected 2 blank lines after
> class or function definition, found 1

Thanks.  Here's an updated version of the script that fixes the
above issues.

contrib/ChangeLog:
* unicode/utf8-dump.py: New file.

Signed-off-by: David Malcolm 
---
 contrib/unicode/utf8-dump.py | 69 
 1 file changed, 69 insertions(+)
 create mode 100755 contrib/unicode/utf8-dump.py

diff --git a/contrib/unicode/utf8-dump.py b/contrib/unicode/utf8-dump.py
new file mode 100755
index 000..f12ee79f9f2
--- /dev/null
+++ b/contrib/unicode/utf8-dump.py
@@ -0,0 +1,69 @@
+#!/usr/bin/env python3
+#
+# Script to dump a UTF-8 file as a list of numbered lines (mimicking GCC's
+# diagnostic output format), interleaved with lines per character showing
+# the Unicode codepoints, the UTF-8 encoding bytes, the name of the
+# character, and, where printable, the characters themselves.
+# The lines are printed in logical order, which may help the reader to grok
+# the relationship between visual and logical ordering in bi-di files.
+#
+# SPDX-License-Identifier: MIT
+#
+# Copyright (C) 2021 David Malcolm .
+#
+# Permission is hereby granted, free of charge, to any person obtaining a
+# copy of this software and associated documentation files (the "Software"),
+# to deal in the Software without restriction, including without limitation
+# the rights to use, copy, modify, merge, publish, distribute, sublicense,
+# and/or sell copies of the Software, and to permit persons to whom the
+# Software is furnished to do so, subject to the following conditions:
+#
+# The above copyright notice and this permission notice shall be included
+# in all copies or substantial portions of the Software.
+#
+# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
+# OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+# MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+# IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+# CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT
+# OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE
+# OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+
+import sys
+import unicodedata
+
+
+def get_name(ch):
+try:
+return unicodedata.name(ch)
+except ValueError:
+if ch == '\n':
+return 'LINE FEED (LF)'
+return '(unknown)'
+
+
+def get_printable(ch):
+cat = unicodedata.category(ch)
+if cat == 'Cc':
+return '(control character)'
+elif cat == 'Cf':
+return '(format control)'
+elif cat[0] == 'Z':
+return '(separator)'
+return ch
+
+
+def dump_file(f_in):
+line_num = 1
+for line in f_in:
+print('%4i | %s' % (line_num, line.rstrip()))
+for ch in line:
+utf8_desc = '%15s' % (' '.join(['0x%02x' % b
+for b in ch.encode('utf-8')]))
+print('%4s |   U+%04X %s %40s %s'
+  % ('', ord(ch), utf8_desc, get_name(ch), get_printable(ch)))
+line_num += 1
+
+
+with open(sys.argv[1], mode='r') as f_in:
+dump_file(f_in)
-- 
2.26.3



Re: [PATCH] contrib: add unicode/utf8-dump.py

2021-11-01 Thread Martin Liška

On 11/1/21 15:14, David Malcolm via Gcc-patches wrote:

|This script may be useful when debugging issues relating to Unicode encoding 
(e.g. when investigating source files with bidirectional control characters).|


I like the script except the following flake8 issues:

$ flake8 contrib/unicode/utf8-dump.py
contrib/unicode/utf8-dump.py:35:1: E302 expected 2 blank lines, found 1
contrib/unicode/utf8-dump.py:43:1: E302 expected 2 blank lines, found 1
contrib/unicode/utf8-dump.py:53:1: E302 expected 2 blank lines, found 1
contrib/unicode/utf8-dump.py:64:1: E305 expected 2 blank lines after class or 
function definition, found 1

Martin


[PATCH] contrib: add unicode/utf8-dump.py

2021-11-01 Thread David Malcolm via Gcc-patches
This script may be useful when debugging issues relating to Unicode
encoding (e.g. when investigating source files with bidirectional control
characters).

It dump a UTF-8 file as a list of numbered lines (mimicking GCC's
diagnostic output format), interleaved with lines per character showing
the Unicode codepoints, the UTF-8 encoding bytes, the name of the
character, and, where printable, the characters themselves.
The lines are printed in logical order, which may help the reader to grok
the relationship between visual and logical ordering in bi-di files.

For example:

$ cat test.c
int གྷ;
const char *אבג = "ALEF-BET-GIMEL";

$ ./contrib/unicode/utf8-dump.py test.c
   1 | int གྷ;
 |   U+00690x69 LATIN SMALL LETTER I i
 |   U+006E0x6e LATIN SMALL LETTER N n
 |   U+00740x74 LATIN SMALL LETTER T t
 |   U+00200x20SPACE 
(separator)
 |   U+0F43  0xe0 0xbd 0x83   TIBETAN LETTER GHA གྷ
 |   U+003B0x3bSEMICOLON ;
 |   U+000A0x0a   LINE FEED (LF) 
(control character)
   2 | const char *אבג = "ALEF-BET-GIMEL";
 |   U+00630x63 LATIN SMALL LETTER C c
 |   U+006F0x6f LATIN SMALL LETTER O o
 |   U+006E0x6e LATIN SMALL LETTER N n
 |   U+00730x73 LATIN SMALL LETTER S s
 |   U+00740x74 LATIN SMALL LETTER T t
 |   U+00200x20SPACE 
(separator)
 |   U+00630x63 LATIN SMALL LETTER C c
 |   U+00680x68 LATIN SMALL LETTER H h
 |   U+00610x61 LATIN SMALL LETTER A a
 |   U+00720x72 LATIN SMALL LETTER R r
 |   U+00200x20SPACE 
(separator)
 |   U+002A0x2a ASTERISK *
 |   U+05D0   0xd7 0x90   HEBREW LETTER ALEF א
 |   U+05D1   0xd7 0x91HEBREW LETTER BET ב
 |   U+05D2   0xd7 0x92  HEBREW LETTER GIMEL ג
 |   U+00200x20SPACE 
(separator)
 |   U+003D0x3d  EQUALS SIGN =
 |   U+00200x20SPACE 
(separator)
 |   U+00220x22   QUOTATION MARK "
 |   U+00410x41   LATIN CAPITAL LETTER A A
 |   U+004C0x4c   LATIN CAPITAL LETTER L L
 |   U+00450x45   LATIN CAPITAL LETTER E E
 |   U+00460x46   LATIN CAPITAL LETTER F F
 |   U+002D0x2d HYPHEN-MINUS -
 |   U+00420x42   LATIN CAPITAL LETTER B B
 |   U+00450x45   LATIN CAPITAL LETTER E E
 |   U+00540x54   LATIN CAPITAL LETTER T T
 |   U+002D0x2d HYPHEN-MINUS -
 |   U+00470x47   LATIN CAPITAL LETTER G G
 |   U+00490x49   LATIN CAPITAL LETTER I I
 |   U+004D0x4d   LATIN CAPITAL LETTER M M
 |   U+00450x45   LATIN CAPITAL LETTER E E
 |   U+004C0x4c   LATIN CAPITAL LETTER L L
 |   U+00220x22   QUOTATION MARK "
 |   U+003B0x3bSEMICOLON ;
 |   U+000A0x0a   LINE FEED (LF) 
(control character)

Tested with Python 3.8

OK for trunk and to backport?

contrib/ChangeLog:
* unicode/utf8-dump.py: New file.

Signed-off-by: David Malcolm 
---
 contrib/unicode/utf8-dump.py | 65 
 1 file changed, 65 insertions(+)
 create mode 100755 contrib/unicode/utf8-dump.py

diff --git a/contrib/unicode/utf8-dump.py b/contrib/unicode/utf8-dump.py
new file mode 100755
index 000..21885a85bdc
--- /dev/null
+++ b/contrib/unicode/utf8-dump.py
@@ -0,0 +1,65 @@
+#!/usr/bin/env python3
+#
+# Script to dump a UTF-8 file as a list of numbered lines (mimicking GCC's
+# diagnostic output format), interleaved with lines per character showing
+# the Unicode codepoints, the UTF-8 encoding bytes, the name of the
+# character, and, where printable, the characters themselves.
+# The lines are printed in logical order, which may help the reader to grok
+# the relationship between visual and logical ordering in