Re: Encode::Tcl Mistery Solved!

Nick Ing-Simmons Tue, 29 Jan 2002 10:02:08 -0800

Dan Kogai <[EMAIL PROTECTED]> writes:
>> Last time I looked CGI communicated via Sockets, and Sockets are IO.
>
>   CGI module itself reads via PerlIO, right.  But most users fetch the
>result via param().  CGI module decodes URL escapes (or mime-coded
>queries such as file uploads).  Now you have to tell in what character
>set param() returns.


No you don't - by this point we are the perl world so we know what
encoding it is - perl's internal one.

>There the real (en|de)coding business is done
>AFTER the PerlIO phase is over.
>
>>> You can't make PerlIO the standard way of setting UTF-8 flag.
>>
>> I not only _can_ I _have_ (made it the _standard_ way) - ;-)
>
>   And force billions of open() in existing codes to be rewritten?

No. We have chosen the defaults so that almost all of them can stay the
way they are (we think).

>
>   Not yet.  I'll work on it.  But *.ucm for CJK is not likely;   It
>simply gets TOO BIG.

Which while ext/Encode/compile exists - to convert .ucm into managable binary.

>One of the alternatives I am thinking is the
>approach mentioned in Encode::XS.
>
>>> If I had _ANY_ test data I would run the compiled test and give you
>>> the comparative number.
>>
>>   You can use t/table.euc under Jcode module for instance.  table.utf8
>> in my code example is just a utf8 version thereof. That's a data which
>> contains all characters defined in EUC (well, actually JISX0212 is not
>> included but very few environments can display JISX0212).

Excellent!

>
>>>   On what occasion legacy compatibility is required for encode()?
>>
>> Encode is built on perl. It takes perl strings. Perl has legacy reason
>> to treat strings a certain way. Encode just works with what it gets.
>
>   To me perl has no 'string'.  It just an PV that happens to store
>strings.

But it does store strings - i.e. sequences of characters in perl's
internal set. XS code in perl5.6+ has to be careful to remember that
and not just blindly assume that a byte is a byte - but that is another story.

>In the age of Unicode we have to be careful on not only the
>term 'char' but 'string' as well....

Yes.

>
>> As for need for legacy and Encode it makes sense for europeans and
>> americans
>> for the things we type to interpreted as iso8859-1 - if we then
>> ask for those strings to be encoded in iso2022 or Big5 then that is
>> a sensible thing for encode() to do - if only to put in the \x1b...
>> escapes...
>
>   Right.  Very fortunately Tcl table does preserve ascii while the
>original table by Unicode consortium did not.
>
>> "        $bytes  = encode(ENCODING, $string[, CHECK])
>>
>> Encodes string from Perl's internal form into I<ENCODING> and returns
>> a sequence of octets.  For CHECK see L</"Handling Malformed Data">.
>
>   Yes, that "Perl's internal form" was the key.  We should be more
>explicit on that.
>
>> "Perl's internal form" means exactly what it says. It _may_ be UTF-8
>> encoded or as raw bytes (on mainframes it may be UTF-EBCDIC encoded).
>> encode takes that form in its full glory SvUTF8 mode bits and all
>> and converts it to the specified encoding.
>
>   The problem is that you have to make sure if $string is either UTF8 or
>ascii or totally unexpected results like I showed in my previous
>articles.

No. Not "unexpected" -  exactly what it specified to do.
all the 256 possible values have a defined meaning in the non-UTF-8 case.
(Which is iso8859-1 on ASCII machines and NativeEBCDIC on EBCDIC mainframes.)


>
>>>   Are you going to tell million of novice CGI users/writer to use eval
>>> for error handling?
>>
>> A. Yes - they really should.
>> B. No  - I am expecting [EMAIL PROTECTED] folk to give them an
>>    Encode module which can "Do What I Mean".
>>    That _Module_ should do the eval {} if necessary.
>
>   IMHO eval is abused as an exception handler.  It is after all eval,
>not try and catch.

Let us not go there - the debate in the perl5-porters archives.

>
>> eval {} is quite cheap - certainly a lot cheeper than lots of
>> if ($Encode::error) test.
>
>   Or is it?  I pretty much doubt if recompilation

I said eval {} not eval "" - no recompilation that at all.

>is cheaper than
>assigning to SV.

But eval {} is comparable to setting SV and then testing the SV.

>
>   Should we implement Encode::Carp like CGI::Carp ?

No.

>
>   Or I love eval and that is one of the big reasons why I use perl.  At
>the same time I know the cost thereof.  eval is so versatile that it is
>too heavy for most cases.  If your statement is true, why don't we
>
>eval{ open FH, "<file"; }; die "Can't open file: $@" if $@ ?
>
>   Instead of ever-popular idiom;
>
>open FH, "<file" or die "Can't open file: $!"

That is a 'throw' not a 'catch'. You do it that way precisely so an
outer eval {} can 'catch' it. If open had been designed to throw in the
first place one would just write:

  open FH, "<file"); # will automaticaly die with right message.

for the normal die case - in those few cases where one writes

  if (open(FH,...) {
  }
  else {
  }

One would write

  eval { open(FH,...) }
  unless ($@) {
  }
  else {
  }

But it is not a fair comparison - files get miss named far more often
than guess_encoding() should try for a non-existing character encoding.


>
>>> As for errors we should give the caller decide how to handle it.
>>
>> We can provide both.
>
>   We definitely should.  But to what extent is a good question.
>Encode::Carp ?

Not if we can avoid it.

>
>> So please donnate code which given an octet stream returns a string
>> suggesting
>> its encoding name...
>
>   I definitely will.
>
>Dan
--
Nick Ing-Simmons
http://www.ni-s.u-net.com/

Re: Encode::Tcl Mistery Solved!

Reply via email to