[scite] UTF-16, unintended changes at 128kB boundary.

Jim Hill Thu, 25 Jan 2007 23:49:16 -0800

Hi all

I have just joined this list,
and have not searched the archives,
so apologies if this has been discussed.


It happens in v168 and in v172, as far as i have seen.

This does not happen with plain ascii,
nor unicode files saved as UTF-8,
only with files saved as UTF-16.

It looks as though the buffer stores unicode as utf-8,
and i guess the error may be in conversion
from utf-8 to utf-16 when saving.

It does not happen in the buffer; you do not see the effect
until you make a change, save the file then reload it.

Here are two examples;
the char on the 1st line (Cyrillic)
was changed to the char on the 2nd line.

char    codepoint       utf8 bytes
м       x043C           D0BC
¼       x00BC           C2BC

char    codepoint       utf8 bytes
в       x0432           D0B2
²       x00B2           C2B2

The character that changed was at offset 131072
ie, 128kB into the file.
When i make the file bigger, it happens in 2 places:
at 128k and also at 256k.

This change seems to happen only if the 2 bytes of the character
straddle the 128kB boundary. If the boundary is between characters
the change does not happen.

The 128kB is of bytes in the utf-8 buffer,
not bytes in the utf-16 file on disk.

However, in v172 another change occurs at the point
of 128kB of bytes on disk. I did not /notice/ this in v168.
This change is not dependent on a character straddling
the 128kB boundary, it always happens.

It may be that these changes only happen if characters at
the boundary are multi-byte, and not if they are single-byte.

I used SciTE to manipulate some text that was to be published
as subtitles on videos. Fortunately a proof-reader noticed the
oddities before too much damage was done.
It happened in 4 different files.

Regards
Jim

PS

How i figure SciTE is using UTF8 in its buffer:
select a curly quote “ and it says 3 selected,
select a cyrillic char, eg в, and it says 2 selected,
select an ascii char and it says 1 selected;
the selection size shown matches the bytes of UTF8.

How i figured the offset of the bad character:
put cursor before bad char, do shift+control+home,
and read the "Selected" number in the status bar.
it showed 131071. ( == x1FFFF; +1 -> x20000 == 128k )

To find the byte offset in the utf-16 format on disk,
I opened the file in UltraEdit v9, and viewed as Hex.

jh

Send instant messages to your online friends http://au.messenger.yahoo.com 

_______________________________________________
Scite-interest mailing list
[email protected]
http://mailman.lyra.org/mailman/listinfo/scite-interest

[scite] UTF-16, unintended changes at 128kB boundary.

Reply via email to