At 10:16 pm -0500 17/1/01, Ronald J Kimball wrote:

|   On Thu, Jan 18, 2001 at 12:38:57PM +1030, Paul McCann wrote:
|   > $_ = "Señor";
|   > %table=(150=>"[0x00F1]",
|   >         151=>"[0x00F2]");  #whatever the mappings are
|   > s~([\x80-\xFF])~$table{ord($1)}~ge;
|   > print;
|   >
|   > will get you started. The key is the "e" modifier..
|
|   Although, since scalar hash values interpolate in double-quoted strings
|   anyway, the /e is actually unnecessary with that replacement.
|
|   s~([\x80-\xFF])~$table{ord($1)}~g;
|
|   will work just as well.

Thank you both Paul and Ronald equally!  It seems I was moving towards
it and it just needed your messages to clear the log jam.

Here's the end result, which is a filter to mark all 8-bit Mac
characters in a text by adding the {Latin-1} equivalent, where it
exists, and the <Unicode html string> for the character.  I post it in
full since it might be useful to people after modification for
particular requirements.  I shall be making another table to fill any
{blanks} that have windows-1252 equivalents.


%macToLatin1plus=(
128=>"{\xC4}<&#xC4;>",  # LATIN CAPITAL LETTER A WITH DIAERESIS
129=>"{\xC5}<&#xC5;>",  # LATIN CAPITAL LETTER A WITH RING ABOVE
130=>"{\xC7}<&#xC7;>",  # LATIN CAPITAL LETTER C WITH CEDILLA
131=>"{\xC9}<&#xC9;>",  # LATIN CAPITAL LETTER E WITH ACUTE
132=>"{\xD1}<&#xD1;>",  # LATIN CAPITAL LETTER N WITH TILDE
133=>"{\xD6}<&#xD6;>",  # LATIN CAPITAL LETTER O WITH DIAERESIS
134=>"{\xDC}<&#xDC;>",  # LATIN CAPITAL LETTER U WITH DIAERESIS
135=>"{\xE1}<&#xE1;>",  # LATIN SMALL LETTER A WITH ACUTE
136=>"{\xE0}<&#xE0;>",  # LATIN SMALL LETTER A WITH GRAVE
137=>"{\xE2}<&#xE2;>",  # LATIN SMALL LETTER A WITH CIRCUMFLEX
138=>"{\xE4}<&#xE4;>",  # LATIN SMALL LETTER A WITH DIAERESIS
139=>"{\xE3}<&#xE3;>",  # LATIN SMALL LETTER A WITH TILDE
140=>"{\xE5}<&#xE5;>",  # LATIN SMALL LETTER A WITH RING ABOVE
141=>"{\xE7}<&#xE7;>",  # LATIN SMALL LETTER C WITH CEDILLA
142=>"{\xE9}<&#xE9;>",  # LATIN SMALL LETTER E WITH ACUTE
143=>"{\xE8}<&#xE8;>",  # LATIN SMALL LETTER E WITH GRAVE
144=>"{\xEA}<&#xEA;>",  # LATIN SMALL LETTER E WITH CIRCUMFLEX
145=>"{\xEB}<&#xEB;>",  # LATIN SMALL LETTER E WITH DIAERESIS
146=>"{\xED}<&#xED;>",  # LATIN SMALL LETTER I WITH ACUTE
147=>"{\xEC}<&#xEC;>",  # LATIN SMALL LETTER I WITH GRAVE
148=>"{\xEE}<&#xEE;>",  # LATIN SMALL LETTER I WITH CIRCUMFLEX
149=>"{\xEF}<&#xEF;>",  # LATIN SMALL LETTER I WITH DIAERESIS
150=>"{\xF1}<&#xF1;>",  # LATIN SMALL LETTER N WITH TILDE
151=>"{\xF3}<&#xF3;>",  # LATIN SMALL LETTER O WITH ACUTE
152=>"{\xF2}<&#xF2;>",  # LATIN SMALL LETTER O WITH GRAVE
153=>"{\xF4}<&#xF4;>",  # LATIN SMALL LETTER O WITH CIRCUMFLEX
154=>"{\xF6}<&#xF6;>",  # LATIN SMALL LETTER O WITH DIAERESIS
155=>"{\xF5}<&#xF5;>",  # LATIN SMALL LETTER O WITH TILDE
156=>"{\xFA}<&#xFA;>",  # LATIN SMALL LETTER U WITH ACUTE
157=>"{\xF9}<&#xF9;>",  # LATIN SMALL LETTER U WITH GRAVE
158=>"{\xFB}<&#xFB;>",  # LATIN SMALL LETTER U WITH CIRCUMFLEX
159=>"{\xFC}<&#xFC;>",  # LATIN SMALL LETTER U WITH DIAERESIS
160=>"<&#x2020;>",      # DAGGER
161=>"{\xB0}<&#xB0;>",  # DEGREE SIGN
162=>"{\xA2}<&#xA2;>",  # CENT SIGN
163=>"{\xA3}<&#xA3;>",  # POUND SIGN
164=>"{\xA7}<&#xA7;>",  # SECTION SIGN
165=>"<&#x2022;>",      # BULLET
166=>"{\xB6}<&#xB6;>",  # PILCROW SIGN
167=>"{\xDF}<&#xDF;>",  # LATIN SMALL LETTER SHARP S
168=>"{\xAE}<&#xAE;>",  # REGISTERED SIGN
169=>"{\xA9}<&#xA9;>",  # COPYRIGHT SIGN
170=>"<&#x2122;>",      # TRADE MARK SIGN
171=>"{\xB4}<&#xB4;>",  # ACUTE ACCENT
172=>"{\xA8}<&#xA8;>",  # DIAERESIS
173=>"<&#x2260;>",      # NOT EQUAL TO
174=>"{\xC6}<&#xC6;>",  # LATIN CAPITAL LETTER AE
175=>"{\xD8}<&#xD8;>",  # LATIN CAPITAL LETTER O WITH STROKE
176=>"<&#x221E;>",      # INFINITY
177=>"{\xB1}<&#xB1;>",  # PLUS-MINUS SIGN
178=>"<&#x2264;>",      # LESS-THAN OR EQUAL TO
179=>"<&#x2265;>",      # GREATER-THAN OR EQUAL TO
180=>"{\xA5}<&#xA5;>",  # YEN SIGN
181=>"{\xB5}<&#xB5;>",  # MICRO SIGN
182=>"<&#x2202;>",      # PARTIAL DIFFERENTIAL
183=>"<&#x2211;>",      # N-ARY SUMMATION
184=>"<&#x220F;>",      # N-ARY PRODUCT
185=>"<&#x3C0;>",       # GREEK SMALL LETTER PI
186=>"<&#x222B;>",      # INTEGRAL
187=>"<&#xAA;>",        # FEMININE ORDINAL INDICATOR
188=>"{\xBA}<&#xBA;>",  # MASCULINE ORDINAL INDICATOR
189=>"<&#x3A9;>",       # GREEK CAPITAL LETTER OMEGA
190=>"{\xE6}<&#xE6;>",  # LATIN SMALL LETTER AE
191=>"{\xF8}<&#xF8;>",  # LATIN SMALL LETTER O WITH STROKE
192=>"{\xBF}<&#xBF;>",  # INVERTED QUESTION MARK
193=>"{\XCL}<&#xA1;>",  # INVERTED EXCLAMATION MARK
194=>"{\xAC}<&#xAC;>",  # NOT SIGN
195=>"<&#x221A;>",      # SQUARE ROOT
196=>"<&#x192;>",       # LATIN SMALL LETTER F WITH HOOK
197=>"<&#x2248;>",      # ALMOST EQUAL TO
198=>"<&#x2206;>",      # INCREMENT
199=>"{\xAB}<&#xAB;>",  # LEFT-POINTING DOUBLE ANGLE QUOTATION MARK
200=>"{\xBB}<&#xBB;>",  # RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
201=>"<&#x2026;>",      # HORIZONTAL ELLIPSIS
202=>"{\xA0}<&#xA0;>",  # NO-BREAK SPACE
203=>"{\xC0}<&#xC0;>",  # LATIN CAPITAL LETTER A WITH GRAVE
204=>"{\xC3}<&#xC3;>",  # LATIN CAPITAL LETTER A WITH TILDE
205=>"{\xD5}<&#xD5;>",  # LATIN CAPITAL LETTER O WITH TILDE
206=>"<&#x152;>",       # LATIN CAPITAL LIGATURE OE
207=>"<&#x153;>",       # LATIN SMALL LIGATURE OE
208=>"<&#x2013;>",      # EN DASH
209=>"<&#x2014;>",      # EM DASH
210=>"<&#x201C;>",      # LEFT DOUBLE QUOTATION MARK
211=>"<&#x201D;>",      # RIGHT DOUBLE QUOTATION MARK
212=>"<&#x2018;>",      # LEFT SINGLE QUOTATION MARK
213=>"<&#x2019;>",      # RIGHT SINGLE QUOTATION MARK
214=>"{\xF7}<&#xF7;>",  # DIVISION SIGN
215=>"<&#x25CA;>",      # LOZENGE
216=>"{\xFF}<&#xFF;>",  # LATIN SMALL LETTER Y WITH DIAERESIS
217=>"<&#x178;>",       # LATIN CAPITAL LETTER Y WITH DIAERESIS
218=>"<&#x2044;>",      # FRACTION SLASH
219=>"<&#x20AC;>",      # EURO SIGN
220=>"<&#x2039;>",      # SINGLE LEFT-POINTING ANGLE QUOTATION MARK
221=>"<&#x203A;>",      # SINGLE RIGHT-POINTING ANGLE QUOTATION MARK
222=>"<&#xFB01;>",      # LATIN SMALL LIGATURE FI
223=>"<&#xFB02;>",      # LATIN SMALL LIGATURE FL
224=>"<&#x2021;>",      # DOUBLE DAGGER
225=>"{\xB7}<&#xB7;>",  # MIDDLE DOT
226=>"<&#x201A;>",      # SINGLE LOW-9 QUOTATION MARK
227=>"<&#x201E;>",      # DOUBLE LOW-9 QUOTATION MARK
228=>"<&#x2030;>",      # PER MILLE SIGN
229=>"{\xC2}<&#xC2;>",  # LATIN CAPITAL LETTER A WITH CIRCUMFLEX
230=>"{\xCA}<&#xCA;>",  # LATIN CAPITAL LETTER E WITH CIRCUMFLEX
231=>"{\xC1}<&#xC1;>",  # LATIN CAPITAL LETTER A WITH ACUTE
232=>"{\xCB}<&#xCB;>",  # LATIN CAPITAL LETTER E WITH DIAERESIS
233=>"{\xC8}<&#xC8;>",  # LATIN CAPITAL LETTER E WITH GRAVE
234=>"{\xCD}<&#xCD;>",  # LATIN CAPITAL LETTER I WITH ACUTE
235=>"{\xCE}<&#xCE;>",  # LATIN CAPITAL LETTER I WITH CIRCUMFLEX
236=>"{\xCF}<&#xCF;>",  # LATIN CAPITAL LETTER I WITH DIAERESIS
237=>"{\xCC}<&#xCC;>",  # LATIN CAPITAL LETTER I WITH GRAVE
238=>"{\xD3}<&#xD3;>",  # LATIN CAPITAL LETTER O WITH ACUTE
239=>"{\xD4}<&#xD4;>",  # LATIN CAPITAL LETTER O WITH CIRCUMFLEX
240=>"<&#xF8FF;>",      # Apple logo
241=>"{\xD2}<&#xD2;>",  # LATIN CAPITAL LETTER O WITH GRAVE
242=>"{\xDA}<&#xDA;>",  # LATIN CAPITAL LETTER U WITH ACUTE
243=>"{\xDB}<&#xDB;>",  # LATIN CAPITAL LETTER U WITH CIRCUMFLEX
244=>"{\xD9}<&#xD9;>",  # LATIN CAPITAL LETTER U WITH GRAVE
245=>"<&#x131;>",       # LATIN SMALL LETTER DOTLESS I
246=>"<&#x2C6;>",       # MODIFIER LETTER CIRCUMFLEX ACCENT
247=>"<&#x2DC;>",       # SMALL TILDE
248=>"{\xAF}<&#xAF;>",  # MACRON
249=>"<&#x2D8;>",       # BREVE
250=>"<&#x2D9;>",       # DOT ABOVE
251=>"<&#x2DA;>",       # RING ABOVE
252=>"{\xB8}<&#xB8;>",  # CEDILLA
253=>"<&#x2DD;>",       # DOUBLE ACUTE ACCENT
254=>"<&#x2DB;>",       # OGONEK
255=>"<&#x2C7;>",       # CARON
);
####### test string
$_ = '

        ¿Señor?
        über
        fêté
        ³Ah!²

';
####### end test
s~([\x80-\xFF])~$1$macToLatin1plus{ord($1)}~g;
print;




Reply via email to