Hi Eli,
> Why do you insist in using octal escapes instead of using the > corresponding Latin characters literally? What characters do you > need to replace and why? I suspect those are raw bytes and not > real characters, but I don't want to guess. > IOW, please give us some more context of your real-life problem. > Then we might be able to help you more efficiently. OK, I'll tell you what I'm trying to do, since it would be wonderful to get this issue resolved once and for all, instead of one ad-hoc solution after another, which I've been doing for ten years. But first, let me expand on the examples I gave in the last message. Here's a definition: (defun 8bit () "Test 8-bit characters" (let* ( (pos (point)) (NL "\n") (char1 "\235") (char2 "\220") (pat1 "\235") (pat2 "[\230-\237]") ) (insert "This is a char: " char1 NL) (insert "This is another char: " char2 NL) (goto-char pos) (query-replace-regexp pat1 "x") ; replaces (goto-char pos) (query-replace-regexp pat2 "y") ; does not work )) Now, open a brand new empty file, and execute this macro. The first replace works, but the second replace does not. I don't know whether this is what's supposed to happen, but at least it doesn't work as I would expect. If you modify the macro so it only does the inserts without the replaces, then the buffer looks like this: This is a char: \235 This is another char: \220 If you hexlify the buffer, it looks like this: 00000000: 5468 6973 2069 7320 6120 6368 6172 3a20 This is a char: 00000010: 9d0a 5468 6973 2069 7320 616e 6f74 6865 ..This is anothe 00000020: 7220 6368 6172 3a20 900a r char: .. where hex 9d is octal 235, and hex 90 is octal 220. You asked me why I use octal instead of the Latin characters literally. I've tried pretty much everything over the years. In the case of this particular example, I've used octal because there are no Latin characters corresponding to \235 and \220 (as far as I know), and emacs displays them in octal form. And even if there were, I need to be able to find and optionally replace a range of characters. I would also mention that if I wanted to do a similar character replacement using a Perl script, there's no problem. -- OK, so here's the overall problem. In the process of writing books and articles, I create text files with text from a variety of sources. The sources can include copy and paste from web sites, doc files, pdf files, and application windows, and can also include text generated by my scripts, usually in Perl or Java. So these text files often contain a variety of characters in other alphabets. I think it's really cool that if I paste text from a web site and it contains Chinese, Japanese or Arabic characters, then emacs displays them correctly. When I save the text file and reload it, these characters are replaced with blanks, and that's fine with me, since I don't really want to save them. But it's different for other characters. For example, if you go to a Turkish media site like dailysabah.com, then you can copy and paste English text with Turkish characters. For example, Erdogan's name appears with a Turkish g, which is a g with a "v" on top (decimal char value 287). So execute the expression "(insert 287)" and see for yourself. If I simply save the file and reload it, I don't want that character to be replaced with a blank. I want it to be replaced with a "g". So when I first started using emacs ten years ago and ran into this problem, the first thing I did was write a macro called "fixg" which replaces a "Turkish g" with "g". I should mention that when I open a file, I use the coding system "windows-1252-dos." I set that up ten years ago, and am willing to consider the possibility that this is wrong. Anyway, so over the years I expanded the fixg macro, one character at a time, so that now it contains dozens of these character transformations. I've appended the current fixg macro to this message, for your amusement. So that solves 99% of the problems, but the other 1% is a nightmare scenario, and this is what motivated my current attempts with chars like \201 and \225. Sometimes emacs opens one of these text files, and magically decides that it's a "(Unix)" file. This is a nightmare because then I have "^M" at the end of each line, and I can't get rid of them. I've written a macro that replaces all ^M's with "", and that gets rid of them for a while, but they come back. I've tried using utility programs to convert files to windows or unix or mac formats, and back again, but the problem is never fixed. I've discovered that emacs magically declares a file to be a "(Unix)" file if it contains enough characters like \201 and \225. So when the nightmare scenario occurs, I have to scan through the file visually and find them. And sometimes these are big files, even 10-30 megabytes, so that's not an easy task. So what I want is a macro that goes through the buffer and finds all characters greater than or equal to 128 = 200 octal = 80 hex. I've tried many times to create such a macro, but I've never succeeded, and the "8bit" macro that I described above shows why. OK, you may be sorry you asked, but that's what I'm trying to do. What's the solution? By the way, the current version of my fixg macro is appended for your amusement. This has been enhanced one character at a time for ten years, so it's a complete mess, but it does work 99% of the time, so I'm reasonably happy with it, except when it doesn't work. One more issue: I've tried converting my files to UTF-8, which could solve the problems once and for all, but I've never been able to get UTF-8 to work in emacs, though I've tried a variety of ways. Emacs says that it's saving it as a UTF-8 file, but it always reloads the file as an ordinary text file, and the problems that I've described keep happening. If you tell me how to get UTF-8 to work in emacs, that might solve the whole problem. Thanks. John (setq g-changes `( "ÃÆÃ¤" "ä" ; 303 244 "̮̩" "é" ; 303 251 "ÃÆÃ¼" "ü" ; 303 274 "âââ‰â¬Å" "--" ; 342 200 223 "âââ‰â¢" "'" ; 342 200 231 - [X'e2'][X'80'][X'99'] "âââ¬ÃÅ" "'" ; 342 200 ???? ; 303 622 302 303 622 302 242 303 242 342 20254 541 302 254 303 20071 305 20034 "âââ¬Ã¦" "..." ; 342 200 246 "ââ∠â" ,SQUO ; 342 200 234 - [X'e2'][X'80'][X'9c'] "âââ¬à ¾" ,SQUO ; 342 200 236 - [X'e2'][X'80'][X'9e'] "âââ¬Ã¢" ,"[*]" ; 342 200 236 "̢̮â¬Å¡Ãâç" ,"[ç]" ; 342 200 236 "Ãâç" ,"[ç]" ; 342 200 236 "âââ¬Å¡Ã¬" "ââ¬" ; 342 202 254 "Ãâà" " " ; \302\240 [general purpose space][X'c2'][X'a0'] "Ãâ÷" " " ; [general purpose space] "Ãâ¦Ã¡" "s" ; 305 241 [As in KoÃâ¦Ã¡ice (Slovakia) - s with 'v' on top] "Ãâââ¬Â¡" "c" ; 304 207 [c with forward ' on top (Swedish) "ð" "[o]" "ââ¬â¢" "'" ; 248-philippines-dismantling-rebel-groups.txt "ââ¬Ë" "'" ; "ââ¬Â¢" "[*]" ; from aei iran newsletter "ââ¬Â¦" "..." "ââ¬â" "--" "ïâ§" "[*]" ; Health+Insurance+Query+Access+Manual+2011.txt "â" "'" "â" "--" "â" "--" "â¦" "..." 8804 "<=" 8805 ">=" 65306 ":" ; 65306 = \177432 = \xff1a = [wide colon] 9113 ":" ; 9113 = \21631 = \x2399 [colon] 9658 ">" ; 9658 = \22672 = \x25ba [blackened right arrow] 9660 "V" ; 9660 = \22674 = \x25bc [blackened down arrow] 9830 "*" ; 9830 = \23146 = \x2666 [black diamond] 10003 "[Check]" ; 10003 = \23423 = \x2713 [check mark] "ïÆÂ¾" "[X]" "ïâ·" "[*]" "Â" ,SQUO "Â" ,SQUO "Â" "'" "Â" "'" "Â" "--" 9632 "[*]" ; 9632 = \22640 = \x25a0 [blackened rectangle] 9642 "[*]" ; 9642 = \22652 = \x25aa [blackened rectangle] 9744 "[]" ; 9744 = \23020 = \x2610 empty rectangle 9746 "[X]" ; 9746 = \23022 = \x2612 rectangle with X 8734 "[INF]" ; 8734 = \21036 = \x221e [infinity symbol] "\401" "a" ; a with a horizontal bar above 7713 "g" ; g with a horizontal bar above "\502" "l" ; Polish l with slanted line thru middle "\504" "n" ; Polish n with acute accent 333 "o" ; Shinzo Abe 363 "u" 275 "e" ; Taiwan: u overbar, e overbar 7717 "h" ; h with a dot underneath 7778 "S" ; S with a dot underneath 299 "i" 363 "u" ; i overbar and u overbar 487 "g" 287 "g" ; 287 and 487 are Turkish g 350 "S" 351 "s" 353 "s" 304 "I" 305 "i" ; Turkish chars 268 "C" 269 "c" 263 "c" 272 "D" 382 "z" 259 "a" 277 "e" 537 "s" 328 "n" ; n with v on top 702 "'" 703 "'" 699 "'" 7732 "Kh" 7733 "kh" ; k with a line underneath 180606 7788 "T" 7789 "t" ; t with dot underneath 7826 "Z" ; cap Z with dot underneath 7693 "d" ; d with a dot underneath 380 "z" ; z with a dot on top (342 200 231) "'" (?\342 ?\200 ?\234) "\"" ; these two don't work 7879 "e" ; Vietnamese: e with hat on top, dot on bottom 7847 "a" 7841 "a" ; Vietnamese variations of 'a' )) (defun fixg (&optional nowait) "jx-Replace Turkish g with latin g" (fix-changes g-changes nowait) ) (defun fix-changes (changes &optional nowait) "jx-Replace Turkish g with latin g" (interactive) (let ( (pos (point)) old new (xbuf (get-buffer-create "*XXXX*")) pat ) (while changes (setq old (pop changes)) (setq new (pop changes)) (setq pat nil) (unless (listp old) (setq old (list old))) ; (print old xbuf) (while old (let ( (elt (pop old)) ) (if (integerp elt) (setq elt (char-to-string elt))) (princ (format "Element: '%s' " elt) xbuf) (setq pat (concat pat elt)) )) (setq pat (regexp-quote pat)) (princ (format "Pattern: '%s' Replace with: %s\n" pat new) xbuf) (goto-char (point-min)) (if nowait (jx-replace pat new t) ; replace all without waiting, don't change case ; else (query-replace-regexp pat new nil) ; repl w/waiting ) ) (goto-char pos) )) [End of message]