An example, but it's still raw. use Encode; open( IN_FH, $inputFile ) or die; while( $line = <IN_FH> ) { eval { $line = decode('big5', $line, Encode::FB_CROAK ) }; if ($@) { warn "ill-formed line at line $. in $inputFile.\n"; printf ERRORLOG "File %s (line %d): %s", $inputFile, $., $line; # $line (in big-5, here) should be \n-terminated. $line = decode('big5', $line ); } # $line is in utf8 on the process following... }
P.S. Another problem. How can it be determined whether that user-defined character (UDC hereafter) is single-byte or double-byte? The file big5-eten.ucm does not contain how to determin the character length in bytes for an unmapped UDC. Of course (but I don't know it's easy or not), you can define a *new* encoding as big-5 with mapping of UDCs at any code points by preparing a new .ucm file. This method may relieve error due to the appearance of UDCs. SADAHIRO Tomoyuki On Fri, 21 Mar 2003 10:52:07 -0500 "Mark Lewellen" <[EMAIL PROTECTED]> wrote: > Hi- > I'm looking for recommendations on how to warn about and record > problems > with ill-formed data. Specifically, I'm reading in Big5 data from > multiple files > and converting it to Perl's utf8, and some of the Big5 double-byte > combinations > are illegal (they appear to be user-defined special symbols). I'd like > to be able > to write code to handle lines with ill-formed data. So, if I start with > code like: > > open( IN_FH, '<:encoding(big5)', $inputFile ) or die... > while( $line = <IN_FH> ) { > > or > > open( IN_FH, $inputFile ) or die... > while( $line = decode('big5', <IN_FH> ) ) { > > I'd like to add logic such as: > > if( <$line has an error> ) > record the line number and file name > record the error and the entire line > map error to user-defined character (dependent on error) and process > the modified line > > Could I get recommendations on how to do this? Thanks- > > Mark > > PS The STDERR "does not map to Unicode" warning on my version (5.8.0) > lists only > the input file's line number; is it possible to add the input file name > as well?