https://github.com/pharo-project/pharo/pull/2433

> On 29 Jan 2019, at 15:36, Sven Van Caekenberghe <s...@stfx.eu> wrote:
> 
> https://github.com/pharo-project/pharo/issues/2395
> 
>> On 27 Jan 2019, at 17:03, Sven Van Caekenberghe <s...@stfx.eu> wrote:
>> 
>> Hi Dominique,
>> 
>>> On 27 Jan 2019, at 11:40, Dominique Dartois <d...@dartois.org> wrote:
>>> 
>>> Hello all. 
>>> If a use french diacritic character in a class name, the code runs but I 
>>> can’t fileout the package nor save it with Monticello. 
>>> For example, the C cedilla in the class name drive me to an 
>>> ‘ZnInvalidUTF8:Illegal byte for utf-8 encoding' when filing out.
>>> 
>>> Is it a bug or a feature?
>>> Thank you
>>> 
>>> <image.png>--- 
>>> Dominique Dartois 
>> 
>> Thanks for reporting this. This is most definitely a bug, I can confirm its 
>> occurrence.
>> 
>> I'm CC pharo-dev as this is quite important. This will be a long mail.
>> 
>> 
>> This is one manifestation of a problem that has been present for quite a 
>> while.
>> 
>> I'll start by describing what I did, what went well and where/how this 
>> fails, some generic points, and two conceptual solutions (that need further 
>> verification).
>> 
>> Like you, I created a new subclass:
>> 
>> Object subclass: #ClasseFrançaise
>>      instanceVariableNames: ''
>>      classVariableNames: ''
>>      package: '_UnpackagedPackage'
>> 
>> With comment:
>> 
>> I am ClasseFrançaise.
>> 
>> Try:
>> 
>>      ClasseFrançaise new élève.
>>      ClasseFrançaise new euro.
>> 
>> And two methods (in the 'test' protocol):
>> 
>> élève
>>      ^ 'élève'
>> 
>> euro
>>      ^ '€'
>> 
>> I added the euro sign (because that is encoded in UTF-8 with 3 bytes, not 2 
>> like ç).
>> Like you said, the system can cope with such class and method names and 
>> seems to function fine.
>> 
>> Looking at the .changes file, the correct source code was appended:
>> 
>> ----SNAPSHOT----2019-01-26T23:36:18.548555+01:00 work.image priorSource: 
>> 339848!
>> 
>> Object subclass: #ClasseFrançaise
>>       instanceVariableNames: ''
>>       classVariableNames: ''
>>       package: '_UnpackagedPackage'!
>> !ClasseFrançaise commentStamp: 'SvenVanCaekenberghe 1/27/2019 12:25' prior: 
>> 0!
>> I am ClasseFrançaise.!
>> !ClasseFrançaise methodsFor: 'test' stamp: 'SvenVanCaekenberghe 1/27/2019 
>> 12:26'!
>> élève
>>       ^ 'élève'! !
>> !ClasseFrançaise commentStamp: 'SvenVanCaekenberghe 1/27/2019 12:27' prior: 
>> 33898360!
>> I am ClasseFrançaise.
>> 
>> Try:
>> 
>>       ClasseFrançaise new élève.
>>       ClasseFrançaise new euro.
>> !
>> !ClasseFrançaise methodsFor: 'test' stamp: 'SvenVanCaekenberghe 1/27/2019 
>> 12:27'!
>> euro
>>      ^ '€'! !
>> 
>> 
>> Doing a file out (or otherwise saving the source code) fails. The reason is 
>> an incorrect manipulation of this source file while looking for what is 
>> called the method preamble, in SourcFileArray>>#getPreambleFrom:at: position
>> 
>> An programmatic way to invoke the same error is by doing
>> 
>> (ClasseFrançaise>>#élève) timeStamp.
>> (ClasseFrançaise>>#élève) author.
>> 
>> Both fail with the same error.
>> 
>> 
>> The source code of methods is (currently) stored in a .sources or .changes 
>> file. CompiledMethods know their source pointer, an offset in one of these 
>> files. Right before the place where the source starts is a preamble that 
>> contains some meta information (including the author and timestamp). To 
>> access that preamble, the source code pointer is moved backwards to the 
>> beginning of the preamble (which begins and ends with a !).
>> 
>> 
>> The current approach fails in the presence of non-ASCII characters. More 
>> specifically because of a mixup between the concept of byte position and 
>> character position when using UTF-8, a variable length encoding (both the 
>> .changes and the .sources are UTF-8 encoded).
>> 
>> For example, consider
>> 
>> 'à partir de 10 €' size. "16"
>> 'à partir de 10 €' utf8Encoded size. "19"
>> 
>> So although the string contains 16 characters, it is encoded as 19 bytes, à 
>> using 2 bytes and € using 3 bytes. In general, moving backwards or forwards 
>> in UTF-8 encoded bytes cannot be done without understanding UTF-8 itself.
>> 
>> ZnUTF8Encoder can do both (moving forward is #nextFromStream: while moving 
>> backwards is #backOnStream:). However, ZnUTF8Encoder is also strict: it will 
>> signal an error when forced to operate in between encoded characters, which 
>> is what happens here.
>> 
>> It is thus not possible to move to arbitrary bytes positions and assume/hope 
>> to always arrive on the correct character boundaries and it is also wrong to 
>> take the difference between two byte positions as the count of characters 
>> present (since their encoding is of variable length).
>> 
>> SourcFileArray>>#getPreambleFrom:at: is doing both of these wrong (but gets 
>> away with it in 99.99% of all cases since very few people name their classes 
>> like that).
>> 
>> There are two solutions: operate mostly on the byte level or operate 
>> correctly on the character level. Here are two conceptual solutions (you 
>> must execute either solution 1 or 2, not both), with two different inputs.
>> 
>> 
>> src := '!ClasseFrançaise methodsFor: ''test'' stamp: ''SvenVanCaekenberghe 
>> 1/27/2019 12:27''!
>> euro
>>      ^ ''€''! !'.
>> 
>> "startPosition := 83"
>> 
>> str := ZnCharacterReadStream on: (src utf8Encoded readStream).
>> str position: 83. "at start of euro, the methods source string"
>> str upToEnd.
>> 
>> str position: (83 - 3). "before ! before euro"
>> 
>> "find the previous ! before position"
>> position := str position.
>> binary := str wrappedStream.
>> encoder := str encoder.
>> 
>> "solution 1"
>> [ position >= 0 and: [ binary position: position. binary next ~= 33 ] ] 
>> whileTrue: [ position := position - 1 ].
>> position.
>> encoder decodeBytes: (binary next: 80 - position).
>> 
>> "solution 2"
>> count := 0.
>> [ str position >= 0 and: [ str next ~= $! ] ] whileTrue: [ 
>>      encoder backOnStream: binary; backOnStream: binary. count := count + 1 
>> ].
>> str position.
>> str next: count.
>> 
>> 
>> Same code, different input (and starting position):
>> 
>> 
>> src := '!ABC!ClasseFrançaise methodsFor: ''test'' stamp: 
>> ''SvenVanCaekenberghe 1/27/2019 12:27''!
>> euro
>>      ^ ''€''! !'.
>> 
>> "startPosition := 87"
>> 
>> str := ZnCharacterReadStream on: (src utf8Encoded readStream).
>> str position: 87. "at start of euro, the methods source string"
>> str upToEnd.
>> 
>> str position: (87 - 3). "before ! before euro"
>> 
>> "find the previous ! before position"
>> position := str position.
>> binary := str wrappedStream.
>> encoder := str encoder.
>> 
>> "solution 1"
>> [ position >= 0 and: [ binary position: position. binary next ~= 33 ] ] 
>> whileTrue: [ position := position - 1 ].
>> position.
>> encoder decodeBytes: (binary next: 80 - position).
>> 
>> "solution 2"
>> count := 0.
>> [ str position >= 0 and: [ str next ~= $! ] ] whileTrue: [ 
>>      encoder backOnStream: binary; backOnStream: binary. count := count + 1 
>> ].
>> str position.
>> str next: count.
>> 
>> 
>> Solution 1 (at the byte level) works because $! (ASCII code 33) is also 
>> encoded as such in UTF-8 and this byte pattern cannot be part of a 2, 3 or 4 
>> byte UTF-8 encoded character. The final byte segment must then be given to 
>> the proper encoder to turn it into characters.
>> 
>> Solution 2 (at the character level) works because it essentially counts 
>> characters while moving backwards correctly (it moves back 2 steps each time 
>> because #next moves 1 step forward, while #peek would not help).
>> 
>> Note that in Solution 1 the moving position is held in an external variable, 
>> while in Solution 2 it is held inside the stream.
>> 
>> 
>> Implementing either solution requires more internal access of the streams 
>> (#wrappedStream), so SourcFileArray>>#getPreambleFrom:at: should at least be 
>> moved to SourceFile, IMHO.
>> 
>> Obviously this is a sensitive/critical area to touch.
>> 
>> We will also need a regression unit test.
>> 
>> On a higher design level, I think that CompiledMethod>>#timeStamp and 
>> CompiledMethod>>#author should be cached at the instance level (a unix 
>> timestamp and a symbol would cost very little).
>> 
>> Sven
>> 
>> 
>> 
>> 
>> 
>> 
> 


Reply via email to