Re: de-utf8-ing a string

2007-10-18 Thread E R
On 10/17/07, Juerd Waalboer [EMAIL PROTECTED] wrote:

 utf8::downgrade();

Thanks!


Re: good name for characters matching [^\0-\377]?

2007-10-18 Thread Martin Hosken
Dear Georg,

 Isn't it about time to find a good name for crippled character sets
 with ordinals below 256 only? Otherwise Unicode characters will
 continue to be considered the special case...
   

Legacy encodings.

Nicely derogatory and generally accepted.

Yours,
Martin



Re: good name for characters matching [^\0-\377]?

2007-10-18 Thread Juerd Waalboer
Georg Bauhaus skribis 2007-10-18 17:01 (+0200):
 Isn't it about time to find a good name for crippled character sets
 with ordinals below 256 only?

These are single byte encodings. I prefer to add the word legacy
too.
-- 
Met vriendelijke groet,  Kind regards,  Korajn salutojn,

  Juerd Waalboer:  Perl hacker  [EMAIL PROTECTED]  http://juerd.nl/sig
  Convolution: ICT solutions and consultancy [EMAIL PROTECTED]


Re: good name for characters matching [^\0-\377]?

2007-10-18 Thread John Delacour

Juerd Waalboer wrote:

E R skribis 2007-10-18  9:50 (-0500):

I'm preparing a presentation about Perl and Unicode support, and I'd
like to give a name for characters with ordinals above 255. Is there a
good name for that class?


They are characters outside the latin-1 range.


Latin-1 has nothing to do with it.  There are countless legacy character
sets that use the code points from 32 to 255, and besides, what
maquerades as Latin-1 in various environments rarely is strict iso-8859-1



How about extended characters???


Bad name, because it would suggest an actual barrier, which in unicode
isn't there.


Bad name also because the legacy character sets are often referred to as
extensions to ASCII up to 255 or below.

Above that they are multi-byte characters, but that doesn't mean they're 
necessarily Unicode, since the CJK legacy character sets are also 
multi-byte.


JD






Re: good name for characters matching [^\0-\377]?

2007-10-18 Thread E R
I should have added that in my presentation I am attempting to present
Perl strings from a character set agnostic perspective. So, even
though there is a strong bias for Perl to treat character ordinals 
255 as Unicode code-points, I don't want people to automatically think
Unicode when encountering one of these non-legacy characters.

I'm just wondering if there is an established term. Perhaps
extended/large character ordinal? It would help as in the sentence:
If your string contains a ___, Perl will assume your string
represents Unicode code-points.

ER


Re: good name for characters matching [^\0-\377]?

2007-10-18 Thread Juerd Waalboer
E R skribis 2007-10-18 16:21 (-0500):
 I should have added that in my presentation I am attempting to present
 Perl strings from a character set agnostic perspective.

That is silly, because Perl itself is not at all character set agnostic.

It has unicode strings and it has binary strings, but those are your
tools.

 So, even though there is a strong bias for Perl to treat character
 ordinals  255 as Unicode code-points

Er, no, all character ordinals, including 0..255, are Unicode
codepoints.

255 is unicode just like 256. There is no actual barrier in between!!

 I don't want people to automatically think Unicode when encountering
 one of these non-legacy characters.

If they don't automatically think of Unicode, they won't be using Perl's
functionality in the most efficient and time saving way. I'm hoping this
is not your desired goal.

To be honest, I'm not sure you know enough about Perl's string model to
be giving a presentation about Unicode in Perl. You just learnt very
important aspects, and from the things you write, I'd say you still have
some other important aspects to learn or accept. No offense meant.

 I'm just wondering if there is an established term. Perhaps
 extended/large character ordinal?

The established term for a character ordinal is code point.

 It would help as in the sentence: If your string contains a ___, Perl
 will assume your string represents Unicode code-points.

If you use your string for text operations, Perl will assume your string
is a Unicode string.

Note that there is a bug in uppercasing/lowercasing, and in some
built-in regular expression character classes, that causes Perl to look
at the internal encoding. This is a leak in the unicode abstraction, and
will probably be fixed with Perl 5.12.

It is very simple (and future proof) to work around this problem by
using the Unicode::Semantics module's up() function, or the built-in
utf8::upgrade().
-- 
Met vriendelijke groet,  Kind regards,  Korajn salutojn,

  Juerd Waalboer:  Perl hacker  [EMAIL PROTECTED]  http://juerd.nl/sig
  Convolution: ICT solutions and consultancy [EMAIL PROTECTED]