An example, but it's still raw.

use Encode;
open( IN_FH, $inputFile ) or die;
while( $line = <IN_FH> ) {
    eval { $line = decode('big5', $line, Encode::FB_CROAK ) };
    if ($@) {
        warn "ill-formed line at line $. in $inputFile.\n";
        printf ERRORLOG "File %s (line %d): %s", $inputFile, $., $line;
           # $line (in big-5, here) should be \n-terminated.
        $line = decode('big5', $line );
    }
    # $line is in utf8 on the process following...
}

P.S. Another problem.
How can it be determined whether that user-defined character
(UDC hereafter) is single-byte or double-byte?

The file big5-eten.ucm does not contain
how to determin the character length in bytes for an unmapped UDC.

Of course (but I don't know it's easy or not),
you can define a *new* encoding as big-5 with mapping of
UDCs at any code points by preparing a new .ucm file.
This method may relieve error due to the appearance of UDCs.

SADAHIRO Tomoyuki


On Fri, 21 Mar 2003 10:52:07 -0500
"Mark Lewellen" <[EMAIL PROTECTED]> wrote:

> Hi-
>   I'm looking for recommendations on how to warn about and record
> problems
> with ill-formed data.  Specifically, I'm reading in Big5 data from
> multiple files
> and converting it to Perl's utf8, and some of the Big5 double-byte
> combinations 
> are illegal (they appear to be user-defined special symbols).  I'd like
> to be able 
> to write code to handle lines with ill-formed data.  So, if I start with
> code like:
> 
> open( IN_FH, '<:encoding(big5)', $inputFile ) or die...
> while( $line = <IN_FH> ) {
> 
> or
> 
> open( IN_FH, $inputFile ) or die...
> while( $line = decode('big5', <IN_FH> ) ) {
> 
> I'd like to add logic such as:
> 
> if( <$line has an error> )
>   record the line number and file name
>   record the error and the entire line
>   map error to user-defined character (dependent on error) and process
> the modified line
> 
> Could I get recommendations on how to do this?  Thanks-
> 
> Mark
> 
> PS  The STDERR "does not map to Unicode" warning on my version (5.8.0)
> lists only 
> the input file's line number; is it possible to add the input file name
> as well?

Reply via email to