https://github.com/pharo-project/pharo/pull/2433
> On 29 Jan 2019, at 15:36, Sven Van Caekenberghe <s...@stfx.eu> wrote: > > https://github.com/pharo-project/pharo/issues/2395 > >> On 27 Jan 2019, at 17:03, Sven Van Caekenberghe <s...@stfx.eu> wrote: >> >> Hi Dominique, >> >>> On 27 Jan 2019, at 11:40, Dominique Dartois <d...@dartois.org> wrote: >>> >>> Hello all. >>> If a use french diacritic character in a class name, the code runs but I >>> can’t fileout the package nor save it with Monticello. >>> For example, the C cedilla in the class name drive me to an >>> ‘ZnInvalidUTF8:Illegal byte for utf-8 encoding' when filing out. >>> >>> Is it a bug or a feature? >>> Thank you >>> >>> <image.png>--- >>> Dominique Dartois >> >> Thanks for reporting this. This is most definitely a bug, I can confirm its >> occurrence. >> >> I'm CC pharo-dev as this is quite important. This will be a long mail. >> >> >> This is one manifestation of a problem that has been present for quite a >> while. >> >> I'll start by describing what I did, what went well and where/how this >> fails, some generic points, and two conceptual solutions (that need further >> verification). >> >> Like you, I created a new subclass: >> >> Object subclass: #ClasseFrançaise >> instanceVariableNames: '' >> classVariableNames: '' >> package: '_UnpackagedPackage' >> >> With comment: >> >> I am ClasseFrançaise. >> >> Try: >> >> ClasseFrançaise new élève. >> ClasseFrançaise new euro. >> >> And two methods (in the 'test' protocol): >> >> élève >> ^ 'élève' >> >> euro >> ^ '€' >> >> I added the euro sign (because that is encoded in UTF-8 with 3 bytes, not 2 >> like ç). >> Like you said, the system can cope with such class and method names and >> seems to function fine. >> >> Looking at the .changes file, the correct source code was appended: >> >> ----SNAPSHOT----2019-01-26T23:36:18.548555+01:00 work.image priorSource: >> 339848! >> >> Object subclass: #ClasseFrançaise >> instanceVariableNames: '' >> classVariableNames: '' >> package: '_UnpackagedPackage'! >> !ClasseFrançaise commentStamp: 'SvenVanCaekenberghe 1/27/2019 12:25' prior: >> 0! >> I am ClasseFrançaise.! >> !ClasseFrançaise methodsFor: 'test' stamp: 'SvenVanCaekenberghe 1/27/2019 >> 12:26'! >> élève >> ^ 'élève'! ! >> !ClasseFrançaise commentStamp: 'SvenVanCaekenberghe 1/27/2019 12:27' prior: >> 33898360! >> I am ClasseFrançaise. >> >> Try: >> >> ClasseFrançaise new élève. >> ClasseFrançaise new euro. >> ! >> !ClasseFrançaise methodsFor: 'test' stamp: 'SvenVanCaekenberghe 1/27/2019 >> 12:27'! >> euro >> ^ '€'! ! >> >> >> Doing a file out (or otherwise saving the source code) fails. The reason is >> an incorrect manipulation of this source file while looking for what is >> called the method preamble, in SourcFileArray>>#getPreambleFrom:at: position >> >> An programmatic way to invoke the same error is by doing >> >> (ClasseFrançaise>>#élève) timeStamp. >> (ClasseFrançaise>>#élève) author. >> >> Both fail with the same error. >> >> >> The source code of methods is (currently) stored in a .sources or .changes >> file. CompiledMethods know their source pointer, an offset in one of these >> files. Right before the place where the source starts is a preamble that >> contains some meta information (including the author and timestamp). To >> access that preamble, the source code pointer is moved backwards to the >> beginning of the preamble (which begins and ends with a !). >> >> >> The current approach fails in the presence of non-ASCII characters. More >> specifically because of a mixup between the concept of byte position and >> character position when using UTF-8, a variable length encoding (both the >> .changes and the .sources are UTF-8 encoded). >> >> For example, consider >> >> 'à partir de 10 €' size. "16" >> 'à partir de 10 €' utf8Encoded size. "19" >> >> So although the string contains 16 characters, it is encoded as 19 bytes, à >> using 2 bytes and € using 3 bytes. In general, moving backwards or forwards >> in UTF-8 encoded bytes cannot be done without understanding UTF-8 itself. >> >> ZnUTF8Encoder can do both (moving forward is #nextFromStream: while moving >> backwards is #backOnStream:). However, ZnUTF8Encoder is also strict: it will >> signal an error when forced to operate in between encoded characters, which >> is what happens here. >> >> It is thus not possible to move to arbitrary bytes positions and assume/hope >> to always arrive on the correct character boundaries and it is also wrong to >> take the difference between two byte positions as the count of characters >> present (since their encoding is of variable length). >> >> SourcFileArray>>#getPreambleFrom:at: is doing both of these wrong (but gets >> away with it in 99.99% of all cases since very few people name their classes >> like that). >> >> There are two solutions: operate mostly on the byte level or operate >> correctly on the character level. Here are two conceptual solutions (you >> must execute either solution 1 or 2, not both), with two different inputs. >> >> >> src := '!ClasseFrançaise methodsFor: ''test'' stamp: ''SvenVanCaekenberghe >> 1/27/2019 12:27''! >> euro >> ^ ''€''! !'. >> >> "startPosition := 83" >> >> str := ZnCharacterReadStream on: (src utf8Encoded readStream). >> str position: 83. "at start of euro, the methods source string" >> str upToEnd. >> >> str position: (83 - 3). "before ! before euro" >> >> "find the previous ! before position" >> position := str position. >> binary := str wrappedStream. >> encoder := str encoder. >> >> "solution 1" >> [ position >= 0 and: [ binary position: position. binary next ~= 33 ] ] >> whileTrue: [ position := position - 1 ]. >> position. >> encoder decodeBytes: (binary next: 80 - position). >> >> "solution 2" >> count := 0. >> [ str position >= 0 and: [ str next ~= $! ] ] whileTrue: [ >> encoder backOnStream: binary; backOnStream: binary. count := count + 1 >> ]. >> str position. >> str next: count. >> >> >> Same code, different input (and starting position): >> >> >> src := '!ABC!ClasseFrançaise methodsFor: ''test'' stamp: >> ''SvenVanCaekenberghe 1/27/2019 12:27''! >> euro >> ^ ''€''! !'. >> >> "startPosition := 87" >> >> str := ZnCharacterReadStream on: (src utf8Encoded readStream). >> str position: 87. "at start of euro, the methods source string" >> str upToEnd. >> >> str position: (87 - 3). "before ! before euro" >> >> "find the previous ! before position" >> position := str position. >> binary := str wrappedStream. >> encoder := str encoder. >> >> "solution 1" >> [ position >= 0 and: [ binary position: position. binary next ~= 33 ] ] >> whileTrue: [ position := position - 1 ]. >> position. >> encoder decodeBytes: (binary next: 80 - position). >> >> "solution 2" >> count := 0. >> [ str position >= 0 and: [ str next ~= $! ] ] whileTrue: [ >> encoder backOnStream: binary; backOnStream: binary. count := count + 1 >> ]. >> str position. >> str next: count. >> >> >> Solution 1 (at the byte level) works because $! (ASCII code 33) is also >> encoded as such in UTF-8 and this byte pattern cannot be part of a 2, 3 or 4 >> byte UTF-8 encoded character. The final byte segment must then be given to >> the proper encoder to turn it into characters. >> >> Solution 2 (at the character level) works because it essentially counts >> characters while moving backwards correctly (it moves back 2 steps each time >> because #next moves 1 step forward, while #peek would not help). >> >> Note that in Solution 1 the moving position is held in an external variable, >> while in Solution 2 it is held inside the stream. >> >> >> Implementing either solution requires more internal access of the streams >> (#wrappedStream), so SourcFileArray>>#getPreambleFrom:at: should at least be >> moved to SourceFile, IMHO. >> >> Obviously this is a sensitive/critical area to touch. >> >> We will also need a regression unit test. >> >> On a higher design level, I think that CompiledMethod>>#timeStamp and >> CompiledMethod>>#author should be cached at the instance level (a unix >> timestamp and a symbol would cost very little). >> >> Sven >> >> >> >> >> >> >