Re: [h-e-w] Processing chars above \200

John J . Xenakis Fri, 21 Sep 2018 10:33:02 -0700

Hi Eli,


>   Why do you insist in using octal escapes instead of using the
>   corresponding Latin characters literally?  What characters do you
>   need to replace and why?  I suspect those are raw bytes and not
>   real characters, but I don't want to guess.

>   IOW, please give us some more context of your real-life problem.
>   Then we might be able to help you more efficiently.

OK, I'll tell you what I'm trying to do, since it would be
wonderful to get this issue resolved once and for all, instead
of one ad-hoc solution after another, which I've been doing for ten
years.

But first, let me expand on the examples I gave in the last message.
Here's a definition:


(defun 8bit ()
"Test 8-bit characters"
(let* (
     (pos (point))  (NL "\n")
     (char1 "\235")  (char2 "\220")
     (pat1  "\235")  (pat2 "[\230-\237]")
  )
    (insert "This is a char: " char1 NL)
    (insert "This is another char: " char2 NL)
    (goto-char pos)
    (query-replace-regexp pat1 "x")  ; replaces
    (goto-char pos)
    (query-replace-regexp pat2 "y")  ; does not work
))

Now, open a brand new empty file, and execute this macro.  The first
replace works, but the second replace does not.  I don't know whether
this is what's supposed to happen, but at least it doesn't work as I
would expect.

If you modify the macro so it only does the inserts without the
replaces, then the buffer looks like this:

This is a char: \235
This is another char: \220

If you hexlify the buffer, it looks like this:

00000000: 5468 6973 2069 7320 6120 6368 6172 3a20  This is a char: 
00000010: 9d0a 5468 6973 2069 7320 616e 6f74 6865  ..This is anothe
00000020: 7220 6368 6172 3a20 900a                 r char: ..

where hex 9d is octal 235, and hex 90 is octal 220.

You asked me why I use octal instead of the Latin characters
literally.  I've tried pretty much everything over the years.  In the
case of this particular example, I've used octal because there are no
Latin characters corresponding to \235 and \220 (as far as I know),
and emacs displays them in octal form.  And even if there were, I need
to be able to find and optionally replace a range of characters.

I would also mention that if I wanted to do a similar character
replacement using a Perl script, there's no problem.

--

OK, so here's the overall problem.  In the process of writing books
and articles, I create text files with text from a variety of sources.
The sources can include copy and paste from web sites, doc files, pdf
files, and application windows, and can also include text generated by
my scripts, usually in Perl or Java.

So these text files often contain a variety of characters in other
alphabets.  I think it's really cool that if I paste text from a web
site and it contains Chinese, Japanese or Arabic characters, then
emacs displays them correctly.  When I save the text file and reload
it, these characters are replaced with blanks, and that's fine with
me, since I don't really want to save them.

But it's different for other characters.  For example, if you go to a
Turkish media site like dailysabah.com, then you can copy and paste
English text with Turkish characters.  For example, Erdogan's name
appears with a Turkish g, which is a g with a "v" on top (decimal char
value 287).  So execute the expression "(insert 287)" and see for
yourself.  If I simply save the file and reload it, I don't want that
character to be replaced with a blank.  I want it to be replaced with
a "g".

So when I first started using emacs ten years ago and ran into this
problem, the first thing I did was write a macro called "fixg" which
replaces a "Turkish g" with "g".

I should mention that when I open a file, I use the coding system
"windows-1252-dos."  I set that up ten years ago, and am willing to
consider the possibility that this is wrong.

Anyway, so over the years I expanded the fixg macro, one character at
a time, so that now it contains dozens of these character
transformations.  I've appended the current fixg macro to this
message, for your amusement.

So that solves 99% of the problems, but the other 1% is a nightmare
scenario, and this is what motivated my current attempts with chars
like \201 and \225.

Sometimes emacs opens one of these text files, and magically decides
that it's a "(Unix)" file.  This is a nightmare because then I have "^M"
at the end of each line, and I can't get rid of them.  I've written a macro
that replaces all ^M's with "", and that gets rid of them for a while,
but they come back.  I've tried using utility programs to convert files
to windows or unix or mac formats, and back again, but the problem is never
fixed.

I've discovered that emacs magically declares a file to be a "(Unix)" file
if it contains enough characters like \201 and \225.  So when the nightmare
scenario occurs, I have to scan through the file visually and find them.
And sometimes these are big files, even 10-30 megabytes, so that's not
an easy task.

So what I want is a macro that goes through the buffer and finds all
characters greater than or equal to 128 = 200 octal = 80 hex.  I've
tried many times to create such a macro, but I've never succeeded,
and the "8bit" macro that I described above shows why.

OK, you may be sorry you asked, but that's what I'm trying to do.
What's the solution?

By the way, the current version of my fixg macro is appended for your
amusement.  This has been enhanced one character at a time for ten
years, so it's a complete mess, but it does work 99% of the time, so
I'm reasonably happy with it, except when it doesn't work.

One more issue: I've tried converting my files to UTF-8, which could
solve the problems once and for all, but I've never been able to get
UTF-8 to work in emacs, though I've tried a variety of ways.  Emacs
says that it's saving it as a UTF-8 file, but it always reloads the
file as an ordinary text file, and the problems that I've described
keep happening.  If you tell me how to get UTF-8 to work in emacs,
that might solve the whole problem.

Thanks.

John

(setq g-changes
        `(
                "ÃÆÃÂ¤" "ÃÂ¤" ; 303 244
                "ÃÆÃÂ©" "ÃÂ©" ; 303 251
                "ÃÆÃÂ¼" "ÃÂ¼" ; 303 274
                "ÃÂ¢Ã¢âÂ¬Ã¢â¬Å" "--" ; 342 200 223
                "ÃÂ¢Ã¢âÂ¬Ã¢âÂ¢" "'" ; 342 200 231 - [X'e2'][X'80'][X'99']
                "ÃÂ¢Ã¢âÂ¬ÃÅ" "'" ; 342 200 ????
                      ; 303 622 302 303 622 302 242 303 242 342 20254 541 302 
254 303 20071 305 20034
                "ÃÂ¢Ã¢âÂ¬ÃÂ¦" "..." ; 342 200 246
                "ÃÂ¢Ã¢âÂ¬Ãâ" ,SQUO ; 342 200 234 - [X'e2'][X'80'][X'9c']
                "ÃÂ¢Ã¢âÂ¬ÃÂ¾" ,SQUO ; 342 200 236 - [X'e2'][X'80'][X'9e']
                "ÃÂ¢Ã¢âÂ¬ÃÂ¢" ,"[*]" ; 342 200 236
                "ÃÆÃ¢â¬Å¡ÃâÃÂ§" ,"[ÃÂ§]" ; 342 200 236
                "ÃâÃÂ§" ,"[ÃÂ§]" ; 342 200 236
                "ÃÂ¢Ã¢â¬Å¡ÃÂ¬" "Ã¢âÂ¬" ; 342 202 254
                "ÃâÃÂ " " " ; \302\240 [general purpose space][X'c2'][X'a0']
                "ÃâÃÂ·" " " ; [general purpose space]
                "Ãâ¦ÃÂ¡" "s" ; 305 241 [As in  KoÃâ¦ÃÂ¡ice (Slovakia) - s 
with 'v' on top]
                "ÃâÃ¢â¬Â¡" "c" ; 304 207 [c with forward ' on top (Swedish)
                "ÃÂ°" "[o]"
                "Ã¢â¬â¢" "'"  ; 248-philippines-dismantling-rebel-groups.txt
                "Ã¢â¬Ë" "'"  ; 
                "Ã¢â¬Â¢" "[*]"  ; from aei iran newsletter
                "Ã¢â¬Â¦" "..."
                "Ã¢â¬â" "--"
                "Ã¯âÂ§" "[*]"  ; Health+Insurance+Query+Access+Manual+2011.txt

                "â" "'"  "â" "--"   "â" "--"   "â¦" "..."
                8804 "<="  8805 ">="

                65306 ":"  ; 65306 = \177432 = \xff1a = [wide colon]
                9113 ":"  ; 9113 = \21631 = \x2399 [colon]
                9658 ">"  ; 9658 = \22672 = \x25ba [blackened right arrow]
                9660 "V"  ; 9660 = \22674 = \x25bc [blackened down arrow]
                9830 "*"  ; 9830 = \23146 = \x2666 [black diamond]
                10003 "[Check]"  ;  10003 = \23423 = \x2713 [check mark]
                "Ã¯ÆÂ¾" "[X]"
                "Ã¯âÂ·" "[*]"
                "Â" ,SQUO  "Â" ,SQUO "Â" "'" "Â" "'"
                "Â" "--"
                9632 "[*]"        ; 9632 = \22640 = \x25a0 [blackened rectangle]
                9642 "[*]"        ;  9642 = \22652 = \x25aa [blackened 
rectangle]
                9744 "[]"         ; 9744 = \23020 = \x2610 empty rectangle
                9746 "[X]"        ; 9746 = \23022 = \x2612 rectangle with X
                8734 "[INF]"    ; 8734 = \21036 = \x221e [infinity symbol]
                "\401" "a"   ; a with a horizontal bar above
                7713 "g"     ; g with a horizontal bar above
                "\502" "l"   ; Polish l with slanted line thru middle
                "\504" "n"   ; Polish n with acute accent
                333 "o"    ; Shinzo Abe
                363 "u" 275 "e"  ; Taiwan: u overbar, e overbar
                7717 "h"  ; h with a dot underneath
                7778 "S"  ; S with a dot underneath
                299 "i" 363 "u" ; i overbar and u overbar
                487 "g" 287 "g"   ; 287 and 487 are Turkish g
                350 "S" 351 "s" 353 "s" 304 "I" 305 "i" ; Turkish chars
                268 "C" 269 "c" 263 "c" 272 "D" 382 "z"
                259 "a" 277 "e" 537 "s"
                328 "n"               ; n with v on top
                702 "'" 703 "'" 699 "'"
                7732 "Kh" 7733 "kh"   ; k with a line underneath 180606
                7788 "T" 7789 "t" ; t with dot underneath
                7826 "Z" ; cap Z with dot underneath
                7693 "d" ; d with a dot underneath
                380  "z" ; z with a dot on top
                (342 200 231) "'" (?\342 ?\200 ?\234) "\"" ; these two don't 
work
                7879 "e" ; Vietnamese: e with hat on top, dot on bottom
                7847 "a" 7841 "a"  ; Vietnamese variations of 'a'

        ))

(defun fixg (&optional nowait)
"jx-Replace Turkish g with latin g"
   (fix-changes g-changes nowait)
)


(defun fix-changes (changes &optional nowait)
"jx-Replace Turkish g with latin g"
(interactive)
(let ( (pos (point)) old new
        (xbuf (get-buffer-create "*XXXX*"))
        pat
    )
    (while changes
        (setq old (pop changes)) (setq new (pop changes))
        (setq pat nil)
        (unless (listp old) (setq old (list old)))
        ; (print old xbuf)
        (while old
            (let ( (elt (pop old)) )
                (if (integerp elt) (setq elt (char-to-string elt)))
                (princ (format "Element: '%s' " elt) xbuf)
                (setq pat (concat pat elt))
        ))
        (setq pat (regexp-quote pat))
        (princ (format "Pattern: '%s'  Replace with: %s\n" pat new) xbuf)
        (goto-char (point-min))
        (if nowait
            (jx-replace pat new t) ; replace all without waiting, don't change 
case
            ; else
            (query-replace-regexp pat new nil) ; repl w/waiting
        )
    )
    (goto-char pos)
))

[End of message]

Re: [h-e-w] Processing chars above \200

Reply via email to