Re: [Github-comments] [geany/geany] Geany encoding determination broken? (#2910)

elextr Thu, 30 Sep 2021 22:31:37 -0700

> By 8-bit ASCII I meant ASCII-compatible 8-bit extension the first time around.


As I explained there is no such thing, ASCII only defines what values less than 
128 mean, it does not define what values with the most significant bit set 
mean.  So no, its also not a suitable name for "no specified encoding", since 
the top 128 values are undefined.  The ISO-8859 series encodings are examples 
of what the top 128 values mean, and there are 16 variants.  Actually "no 
specified encoding" might be a good label for the setting since it alludes to 
falling back on searching for one.

> So the encoding detection logic works sporadically even with longer files.

What you are actually seeing is "sporadically a longer file with random errors 
happens to be a valid file in some encoding".  So that encoding is found.  This 
is also why "Geany is perfectly able to handle some hybrid ASCII-binary files", 
the file happened to be a valid encoding.  The encoding detection is not buggy, 
the files are :grin:   

So the solution would be to fix the file, possibly with a script that replaced 
non-ASCII values with a selected ASCII value, or mask off the MSB, or replace 
the non-ASCII with a valid and very visible UTF-8 character, possibly an Emoji 
like :imp:.  Thats something only you as the programmer can decide and do.

>  the file is sometimes opened properly, meaning 7-bit ASCII characters are 
> displayed and other values are displayed with hex symbols. Other times it's 
> an UTF-16 jungle because of a single 8-bit value occurrence or similar.

Thats again a manifestation of finding an encoding where the file is valid and 
converting that to UTF-8 in the buffer.  

The display as hex values is something the font management does, not Geany, 
either generating a synthetic glyph when no font has one for that value, or a 
font in your stack has that glyph.  Sometimes missing glyphs are shown as 
squares not hex. (to be precise its done by 
[Harfbuzz](https://harfbuzz.github.io/) used by 
[Pango](https://pango.gnome.org/) which is part of [GTK](https://www.gtk.org/) 
which is used by [Scintilla](https://www.scintilla.org/) which is used by 
Geany, so its well buried behaviour and controlled by "many" things).

> Why wouldn't user be able to manually set encoding

 The current behaviour of searching for a valid encoding has evolved to handle 
a common use-case where files are in mixed encodings (is/was common on Windows 
in non-English speaking locales IIUC and many Geany contributors are in such 
locales).  It would be possible to add a "use this encoding only" option (but 
somebody has to do it), but the result if the file was not valid in that 
encoding would have to be a refusal to load since Geany would have no idea what 
UTF-8 the file contents were meant to be converted to if the file was not valid 
in that encoding.

> Also, I tried to set something other than Without encoding in Preferences > 
> Files > Encodings > Default encoding (existing non-Unicode files) and 
> iso88591.txt still opens as ISO-8859-1. Does this also not add up to you or 
> am I missing something?

Again if you selected something other than a valid encoding for that file the 
behaviour is to fallback to searching for a working encoding.  All the encoding 
settings are "try this first, then search" rather than "just try this or fail". 
 Thats probably where the "None" for the default encoding comes from, meaning 
"Don't try anything first, just search".

Just to finally re-emphasise, the Geany buffer has to be valid UTF-8 with no 
embedded NULLs, all the editing and other functions assume it, and depend on 
it, so loading invalid UTF-8 sequences "without encoding" is not possible, the 
input must be valid in UTF-8 or an encoding that can be converted to UTF-8.

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/geany/geany/issues/2910#issuecomment-931823480

Re: [Github-comments] [geany/geany] Geany encoding determination broken? (#2910)

Reply via email to