Re: [h-e-w] Processing chars above \200

John J . Xenakis Sun, 23 Sep 2018 06:11:55 -0700

Hi Eli,

I'm relieved to say that I've found a workaround for the current
situation.  It's another ad-hoc solution, but I was facing a big mess,
and this was an easy solution.


I've now been able to state more specifically what's going on.

The characters in question are ordinary 8-bit extended ascii
characters, like the European vowels with accents, or the 8-bit
equivalents of the single and double quotes.  These characters come
from ordinary web pages or PDF files or WORD documents.  They're
fairly standard in English-language media, and of course also in
foreign language media.

You suggested that I use the "raw-text" coding system, implying that
these characters are random binary data.  But they're actually
completely valid 8-bit characters that are commonly used in Western
media.

I'm beginning to remember now why ten years ago I set up the default
coding system to be "windows-1252-dos."  This was the coding system
most often used to display web pages in IE and Firefox.  This coding
system is standard because it displays all the characters in web pages
from North American and European web sites correctly.  Since I wanted
exactly the same thing in the editor, I used the same coding system in
emacs.

And this works very well in emacs.  The 8-bit characters are displayed
exactly the way they should be.  Furthermore, saving and reloading the
text file preserves the 8-bit characters, so all is well.

The exception is when emacs loads a large Windows text file containing
sufficiently many 8-bit European characters, and emacs goes through
its sampling algorithm and unilaterally declares it to be a Unix file.

This is the nightmare scenario I've been talking about, and it's
typically a disaster.  Emacs does something to every 8-bit character
so that it displays incorrectly, using that octal format, creating a
huge mess.

Furthermore, ordinary commands stop working.  For example,
forward-paragraph no longer works, because ^M is no longer recognized
as an end of line character.

So the net result is that emacs loads a Windows text file on a Windows
system, decides that it's really a Unix file (which it isn't), and
then really damages the file in a way that's almost impossible to
recover from.  Eli, this is not something that an editor should be
doing gratuituously.

So anyway, as I said, I found an ad-hoc workaround.  I have this very
large text file that's in this damaged state, and I was dreading
having to go through and fix it character by character, and that's
what motivated my original message.

So the ad-hoc workaround is this:

* Open the file in Notepad.  All the 8-bit characters are displayed
  correctly.
* Select and copy the entire text in Notepad.
* In emacs, open a new text file.
* Paste the text that you copied from Notepad.
* Save the result.

Much to my relief, this cures all the 8-bit problems, and I can go
back to reloading and editing the file in emacs.

I have a few additional notes:

Note 1: You asked me to select the problem characters, and type
"C-x=".  After going through the workaround, I can now look at
"before" and "after" versions of the same text in two different files
and buffers.

So I select the character é (e with an acute accent, as in the first
letter of the French spelling of the word elite).  Here is the
information that "C-x=" provides in each of the two cases, the damaged
and repaired file respectively:

Char: \351 (4194281, #o17777751, #x3fffe9, raw-byte) point=76501 of
343691 (22%) column=51

Char: é (233, #o351, #xe9, file #xE9) point=74734 of 336596 (22%)
column=51

Note 2: As an additional experiment, I open the repaired file in
"emacs-Q".  It comes up with a coding system of "raw-text-dos," and it
displays the above character as "\351", but without declaring it to be
a Unix file.

If I use "C-x=" on the same character, I get the following:

Char: \351 (4194281, #o17777751, #x3fffe9, raw-byte) point=74734 of
336596 (22%) column=51


Note 3: You asked what software I'm running:

OS: Windows 7 Professional
Editor: GNU Emacs 25.1.1 (i686-w64-mingw32)
WP: Microsoft Word 2003 and 2013
Browser: Firefox Quantum 62.0 (64-bit)

So I hope that information is helpful.  I'm really relieved that I
found this latest ad-hoc workaround, but if there's any way to provide
an option so that I can completely suppress that Unix identification
algorithm, I would really appreciate it, and I suspect that I'm not
the only one.

Thanks.

John

Re: [h-e-w] Processing chars above \200

Reply via email to