On 20.12.2009 22:07, Igor Stasenko wrote: > 2009/12/20 Henrik Sperre Johansen<[email protected]>: > >> On 20.12.2009 20:04, Igor Stasenko wrote: >> >>> Hello, >>> i finished this stuff, and its ready for adoption. >>> >>> >> Nice! >> >>> See http://bugs.squeak.org/view.php?id=7428 >>> >>> Anyone wants to help pushing it into trunk update stream (using MC configs)? >>> >>> It works fine on recent trunk image, >>> on pharo however i had some problems installing changes, because of >>> some differencies. >>> >>> Tried on PharoCore-1.1-11106-ALPHA.image >>> >>> phase2.1.cs >>> - do not filein the TextEditor changes, since pharo-core don't have it. >>> - do not filein the last line (reorganizing).. >>> >>> - tests failing because pharo String class does not implements >>> #squeakToUtf8 >>> nor >>> #utf8ToSqueak >>> >>> Do we having an uniform way how to encode ANY String -> ByteString(utf8) >>> and back? What ANSI standard saying about it? Maybe i'm using wrong methods? >>> >>> >> "3.4.6.4 - It is erroneous if stringBody contains any characters that >> does not exist in the implementation >> defined execution character set used in the representation of character >> objects." >> So, implementation defined. >> Every internal String (in Squeak and Pharo) (afaik) should be either >> latin1 (ByteStrings) or + utf32 with the high byte used for >> differentiation between language of the string. >> >> To me, sending squeakToUtf8, then using StandardFileStream instead of >> FileStream seems safe. >> As long as the ByteString's bytes is utf8, utf8ToSqueak works. (And in >> most other cases as well) >> In fact, it's safer than UTF8Decoder for non-utf8 strings, which does >> not perform the validity checks (only reads the total #of bytes) when >> encountering bytes> 127. >> The reason it seems mostly for internal use (to me) is the fact it >> silently falls back to assuming string is already in latin1 (ie, the >> "valid" ByteString format), instead of raising an error like the stream >> decoder does. (Which, by the way, would be much nicer if was a >> MalformedUTF8Error or some such...) >> >> ws := StandardFileStream newFileNamed: 'test.txt'. >> "Save as latin1" >> ws nextPutAll: 'ååå'. >> ws close. >> >> "Read with UTF8Decoder" >> rs := FileStream oldFileNamed: 'test.txt'. >> "Print this, gives a ?" >> rs contents. >> rs close >> >> "Read with Latin1Decoder" >> rs := StandardFileStream oldFileNamed: 'test.txt'. >> "Print this, gives ååå. since it's not valid utf8, thus assumes latin1" >> rs contents utf8ToSqueak. >> rs close >> >>> Still, i think we need this thing standartized and be common for all >>> dialects (not just Pharo/Squeak). >>> >>> >> There's really only one way to store characters in a ByteArray (ie. >> ByteString) and call it utf8 encoded. >> As far as I can tell, Squeak seems to do the right thing :) >> I believe Nicolas pushed for implementation in Pharo some time ago, not >> sure what happened to that. >> >> > I seems solved this by using #convertToEncoding: / #convertFromEncoding: . > Tests working fine after that. I didn't tried however to use source > with other than Latin1 characters yet. > Converting to utf8 from ByteString/WideString should not be a problem, as long as you know the ByteString encoding is latin1. (Which it should if created it by any normal means) As long as you are SURE the string you are decoding is utf8 (like, when you've encoded them all yourself ;) ), convertFromEncoding: shouldn't be a problem either. (See previous mail, it's the same as used by FileStream, so lacks the validity checks).
Cheers, Henry _______________________________________________ Pharo-project mailing list [email protected] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
