On Tue, 20 Nov 2001 16:49:38 +0000 (GMT), in perl.unicode you wrote: > binmode STDIN; > while(<>) > { > $u = utf16($_); > $u->byteswap2 if defined $swap; # $swap defined based on command line options
This looks strange. The way I read the manpage, byteswap2 is meant to be called as a function, not as a Unicode::String object method. In other words, its first parameter is supposed to be a string, not a Unicode::String object (which will happen if you invoke it as a method on an object). Did you mean either $u = utf16($_); $u->byteswap if defined $swap; or $_ = byteswap2($_) if defined $swap; $u = utf16($_); ? > print $u->utf8; > # some progress report code (one '.' every 1000 lines) > } > Having spotted the first line - could it be that I should avoid > while(<>) and use read() instead ? That sounds good -- the U+000A (represented as '0A 00' in little-endian order) got ripped apart by your line-oriented processing. Actually, you can use <> as long as you change the value of $/ from its default of "\n" to "\x0a\x00" so that it'll read the entire UTF-16 character in one go. And your file does indeed look as if the first line was (correctly) interpreted as UTF-16LE (probably because of the BOM "FF FE" at the beginning), but everything afterwards as UTF-16BE (the default endianness for Unicode::String). So "... 00 1F 00 17 53 AC 4E 1F 00 ..." was interpreted not (as you wanted) as "[00xx] 001F 5317 4EAC 001F" but rather as "001F 0017 53AC 4E1F [00xx]". So instead of going (Big5) "... 北 京 中 國 第 一 歷 史 檔 案 館 ... 1984 ... 微 捲 1 捲 ..." / "... Beijing Zhongguo diyi lishi tang'an guan..." (Beijing China first historical something-or-other?), you get mojibake or character salad, including a hyphen '-' followed by bu 'not', a bit later one a '1', "\x7f", a '(R)' registered trademark sign, a lowercase 'r', and so on ("厬 丟 - 下 圬 笀 ?? 毲 厔 橈 栨 餟 1 \x1f (R) ?? ?? r 挾"). So your byteswapping went wonky, presumably due to loss of synchronisation. So, I suggest setting $/ = "\x0a\x00" and then reading, and explicitly byteswapping each line before converting it with utf16(). That's assuming all your data is in little-endian UTF-16. Cheers, Philip