Hi Shigio I am still talking about the general issue of Python3 compatibility of the pygments_parser.py, a broader issue that was exposed by the original bug report that started this thread. My previous email was just summarizing the results of my research, for discussion.
Best regards, Marcus On Mon, Jun 3, 2024 at 8:21 AM Shigio YAMAGUCHI <[email protected]> wrote: > Hi Marcus, > > Are you talking about a bug or a feature addition? > If it is a bug, could you please explain the specific steps to reproduce > it? > If it is a new feature, could you please explain the specification? > Thank you in advance. > > Regards, > Shigio > > On Thu, May 30, 2024 at 2:33 AM Marcus Harnisch > <[email protected]> wrote: > > > > Hi Shigio > > > > I am thinking about tackling this feature in a reasonably useful and > robust way. I am not concerned about Python 2.x, but wouldn't want to break > compatibility either. As it stands, ‘latin1’ encoding is used for > implementing something like “binary but with newlines”. > > > > The current implementation of pygments_parser.py is incomplete wrt I/O > encoding and will probably break when challenged with characters outside > the ASCII range. > > Encodings of any form of input that are not ASCII-compatible are > probably not going to work at all. > > Many OS-facing functions, such as ‘os.getenv’, but also the low-level > parts of ‘subprocess.Popen()’ use ‘sys.getfilesystemencoding()’ for > determining the desired encoding. Most current unixoid OS are configured to > UTF-8 based locales, and even Python on Windows defaults to UTF-8 for > OS-facing encoding (since 2016, Python 3.6+, PEP 529). > > Any non-ASCII content of gtags.conf is most likely going to break > pygments_parser.py in one way or another. I'd propose to rely on > ‘sys.getfilesystemencoding()’ as well for reading. > > Source code must be presented to Pygment's Lexers as string. Programming > languages that allow non-ASCII source code would normally use UTF-8 (e.g. > Python), which I'd recommend for ‘read_file()’, possibly with an > appropriate error handler. Depending on how a Lexer implements string > handling, exotic encodings might even be less broken than before if bytes > are preserved via ‘surrogateescape’ or ‘backslashreplace’. > > > > IMHO, relying on the respective system default encoding in most places > and an explicit UTF-8 in read_file() is going to improve compatibility and > by side effect helps with unifying code paths between Python 2 and 3. > > > > Best regards, > > Marcus > > > > On Thu, May 16, 2024 at 12:42 AM Marcus Harnisch < > [email protected]> wrote: > >> > >> Hi Shigio > >> > >> Glad to hear that it didn't work :-) Thank you for adding this to the > known bugs list. > >> > >> Best regards, > >> Marcus > >> > >> On Tue, May 14, 2024 at 8:16 AM Shigio YAMAGUCHI <[email protected]> > wrote: > >>> > >>> Hi Marcus, > >>> I confirmed that the problem is reproduced. > >>> I have made a new entry to the 'Known bugs' list. > >>> Thank you for the report. > >>> > >>> [https://www.gnu.org/software/global/bugs.html] > >>> o Pygments plug-in parser with python3 does not work, if 'ctagscom' is > not set. > >>> If it is not set, default path obtained by configure script should > be used. > >>> > >>> $ cat > gtags.conf > >>> default:\ > >>> :ctagscom=:\ > >>> :langmap=C\:.c.h:\ > >>> :gtags_parser=C\:/usr/local/lib/gtags/pygments-parser.la: > >>> $ gtags > >>> $ global -x '.*' > >>> $ _ # no tags > >>> > >>> Regards, > >>> Shigio > >>> > >>> On Mon, May 13, 2024 at 5:04 PM Marcus Harnisch > >>> <[email protected]> wrote: > >>> > > >>> > Hi Shigio > >>> > > >>> > On Sat, May 11, 2024 at 5:35 AM Shigio YAMAGUCHI <[email protected]> > wrote: > >>> >> > >>> >> $ cat gtags.conf > >>> >> default:\ > >>> >> :ctagscom=/opt/local/bin/uctags:\ > >>> >> :langmap=C\:.c.h:\ > >>> >> :gtags_parser=C\:/usr/local/lib/gtags/pygments-parser.la: > >>> > > >>> > > >>> > The important difference, which exposes the bug, is your explicit > configuration of ctagscom. Leave it undefined and rely on whatever > UNIVERSAL_CTAGS has been configured to. Only if ctagscom is empty, you will > see a comparison between b'' (empty bytearray) and '' (empty string). > >>> > > >>> > Best regards, > >>> > Marcus > >>> > >>> > >>> > >>> -- > >>> Shigio YAMAGUCHI <[email protected]> > >>> PGP fingerprint: > >>> 26F6 31B4 3D62 4A92 7E6F 1C33 969C 3BE3 89DD A6EB > > > > -- > Shigio YAMAGUCHI <[email protected]> > PGP fingerprint: > 26F6 31B4 3D62 4A92 7E6F 1C33 969C 3BE3 89DD A6EB >
