On Thu, Aug 11, 2016 at 11:51:06PM +0200, Ingo Schwarze wrote:
> Hi Scott,
> 
> Scott Vanderbilt wrote on Thu, Aug 11, 2016 at 12:58:17PM -0700:
> 
> > I'm trying to use sed to munge some text in HTML files, converting
> > Unicode characters to their HTML entity equivalents, however I can't
> > seem to get it to work.
> > 
> > For instance, this command has no apparent effect:
> > 
> >   sed -i -e 's/\xe2\x80\x94/—/g' foo.html
> > 
> > Other sed operations using ASCII arguments work fine.
> > 
> > Does sed support Unicode in this fashion?
> 
> Our sed(1) does not have *explicit* UTF-8 support yet.
> That means, /./ will not match a multibyte character, but /../ will
> match a character if its UTF-8 representation is two bytes long,
> or the last byte of one character together with the first of the
> next.  [-] ranges will not work with UTF-8 characters, //i case
> folding will not work, and so on and so forth...
> 
> However, you can still use sed(1) for your job by simply
> treating UTF-8 characters as any ordinary byte string.
> 
> I suspect your problem is that the way you enter the multibyte
> characters is incorrect, and the line shown above doesn't actually
> contain UTF-8, but only ASCII: '\\', 'x', 'e' and so on.
> 
> Let me show you an example that does work:
> 
>    $ hexdump -C input.utf8
>   00000000  3e c3 a4 3c 0a                      |>..<.|
>   00000005
>    $ hexdump -C script.sed           
>   00000000  73 2f c3 a4 2f 61 65 2f  0a         |s/../ae/.|
>   00000009
>    $ schwarze@isnote $ sed -f script.sed input.utf8 | hexdump -C
>   00000000  3e 61 65 3c 0a                      |>ae<.|
>   00000005
> 
> Note how the U+00E4 = 0xc3a4 = LATIN SMALL LETTER A WITH DIAERESIS
> gets replaced.
> 
> With that help, you ought to be able to get your task done.
> 
> > The sed(1) man page is silent.
> 
> That's because nothing was done yet to make sed(1) aware of UTF-8.
> 
> > The FAQ section on Character Sets
> > <http://www.openbsd.org/faq/faq10.html#locales> indicates that:
> > 
> >    OpenBSD uses the ASCII character set by default.
> 
> Uh oh.  Ah, hrm.  Well, kind of, but not really.
> 
> The LC_CTYPE locale defaults to "C", but that's required for any
> POSIX-conforming operating system.  By default, ksh(1) emacs editing
> mode partly supports UTF-8, even when LC_CTYPE is C, but ksh(1) vi
> editing mode does not yet (i have a partial patch for that).  By
> default, xterm(1) and pod2man(1) run in UTF-8 mode on OpenBSD, while
> they default to strange hybrids of ASCII and ISO-LATIN-1 elsewhere.
> man(1) always fully supports UTF-8 input, but avoids it for output
> unless you set LC_CTYPE to SOMETHING.UTF-8 or pass it the -Tutf8
> flag.  And so on for many programs...  Even to describe the default
> for one single program, saying nothing but a single word "ASCII" or
> "UTF-8" is usually insufficient, and different programs are very
> different.
> 
> Talking about "the" default makes no sense, really.
> 
> > It also supports the Unicode (UTF-8) character set.
> 
> Ooops!  Do we really say that?  That's a bold claim...  :-o
> 
> In a way, it is true.  You can do many things with UTF-8
> characters, and arguably, that wouldn't be possible if UTF-8
> weren't supported, right?
> 
> Then again, it is not completely true.  There are still many tools
> that do not fully support UTF-8, and some that don't at all.
> 
> > but I'm not sure what bearing that has on this issue.
> 
> You are exactly right!  That statement is so imprecise that it is
> completely unclear what it is: more or less true, a bold lie, or a
> sweeping generalization?
> 
> ...
> 
> Now i'm starting to feel curious.  Let me read on:
> 
>   "The list of supported locales can be obtained by running the
>    command:  locale -a"
> 
> YIKES!!  It looks like i urgently have to fix that part of the FAQ.
> As i stands, it is spreading FAQ:  Fear, Ancertainty, and Quoubt.

In addition to Ingo's advice, you can also use gnu sed (pkg_add gsed) or
perl.

-- 
Juan Francisco Cantero Hurtado http://juanfra.info

Reply via email to