Re: [Pharo-project] invalid utf8 input detected

Stéphane Ducasse Sat, 23 May 2009 12:39:31 -0700

Excellent!
Thanks guys.
I'm preparing a lectures for torino and I will experiment with umejava  
mcz fixes.


Stef

On May 23, 2009, at 8:49 PM, Adrian Lienhard wrote:

> Wow, great analysis, Nicolas!
>
> I was trying to find the cause for several hours now. Your third track
> exactly matches my findings.
>
> For example in Object>>#doesNotUnderstand: prior to the condensing,
> the source contained a non-ASCII character (UTF8 encoded as the two
> bytes: 192 160). This gets correctly transferred during the condensing
> into the new changes file. When you don't save the image (and hence
> have the standard stream without UTF8 encoder) what you see in the
> source is the character Â (this is 192). That is, we suddenly have two
> characters, 192 and 160 where before there was just one. If you load a
> package, MC will compare methods and think this is a change. When
> loading the method from the MC file, the source is UTF8 encoded,
> producing a unicode character 160. When storing this source to the
> file (still without the encoder), it will just directly put 160 there.
> At this point we have lost the leading  byte 192. Next time we start
> or save the image and have the right encoder again, it will choke
> because 160 is an invalid first byte in UTF8.
>
> I think it's save to fix the invalid methods by overriding their
> source. So we don't have to backtrack to version 10297.
>
> Thanks,
> Adrian
>
>
> On May 23, 2009, at 19:57 , Nicolas Cellier wrote:
>
>> I confirm the scenario:
>> 1) update10298 condenseChanges that let (SourceFiles at: 2) class =
>> StandardFileStream
>>  This is the seed of further problems, because further changes will
>> be encoded in latin1 (or MacRoman I don't really wnt to know)
>> 2) update10302 changes the methods with non ASCII characters
>> 3) Stef save the image after update10304, that does reopen
>> (SourceFiles at: 2) in UTF-8, but that's too late, the worm is in the
>> apple.
>>
>> If you save the image just after the condenseChanges, no problem
>> because (SourceFiles at: 2) is opened in Latin1 AFTER all the changes
>> have gotten into it, and reopened UTF-8 before any changes got into
>> it.
>> We must track undue usage of StandardFileStream such as
>> #condenseChanges.
>>
>> 2009/5/23 Nicolas Cellier <[email protected]>:
>>> What happened exactly is very hard to trace because these FileStream
>>> are a can of worms...
>>> Here are some of my perigrinations:
>>>
>>> FIRST POSSIBLE TRACK:
>>>
>>> All methods were changed in 10305.
>>> Monticello snapshot/source.st is not UTF-8.
>>> If the file is opened UTF-8, then we get decompiledCode, I don't
>>> know why yet...
>>> But the changes still go into the change log in correct UTF-8 form,
>>> so
>>> that's just another bug, but not the real source of the problem.
>>> For getting some worms out of the can just browse inst var defs of
>>> converter in MultiByteFileStream:
>>> The accessor #converter initialize converter with TextConverter
>>> defaultSystemConverter which depends on LanguageEnvironment.
>>> That is a Latin1TextConverter in my latin image.
>>> Unless #reset is called first, in which case it will initialize
>>> with a
>>> UTF8TextConverter.
>>> Yes, but open: fileName forWrite: writeMode, does the job too with a
>>> UTF8TextConverter.
>>> You still follow? me neither.
>>> A better behaved is #setConverterForCode that should let non UTF-8
>>> .mcz work in UTF-8 environment, but not sure if called where
>>> required...
>>> I think Yoshiki changes are necessary only for writing source code
>>> with character code > 255.
>>> This was not the case of incriminated methods.
>>>
>>> SECOND POSSIBLE TRACK:
>>>
>>> Everything going to the change log pass thru the  
>>> MultiByteFileStream,
>>> so how did non UTF-8 characters went in?
>>> I tried to follow two other clues:
>>> 1) There are senders of #primWrite:from:startingAt:count: not
>>> redefined in MultiByteFileStream...
>>> for example, using #next:putAll:startingAt: will bypass the
>>> converter.
>>> 2) using nextPutAll: with a ByteArray argument also does bypass the
>>> converter (See MultiByteFileStream>>#nextPutAll:)
>>> I did not find the senders (you really believe senders of  
>>> nextPutAll:
>>> can be analyzed?).
>>> I tried to instrument code with Notification, but I'm unable to
>>> reproduce the problem, so that was vain...
>>>
>>> THIRD POSSIBLE TRACK:
>>>
>>> http://gforge.inria.fr/frs/download.php/22283/
>>> Pharo0.1Core-10304cl.zip
>>> has the invalid UTF-8 problem, just before 10305 changes that
>>> introduced decompiled code...
>>> So we might attack the problem with another code snippet:
>>>
>>> (SystemNavigation default browseAllCallsOn: (Smalltalk  
>>> associationAt:
>>> #SourceFiles))...
>>>
>>> Hmm, I might have a better clue now.
>>> The problem might possibly come from the condenseChanges in
>>> update10298.
>>> What happen in a condenseChanges?
>>> Changes are copied to this file:
>>>
>>> f := FileStream fileNamed: 'ST80.temp'.
>>>
>>> So far, so good, because the concreteStream is a  
>>> MultiByteFileStream.
>>>
>>> But the end finishes with:
>>>
>>>      SourceFiles
>>>              at: 2
>>>              put: (StandardFileStream oldFileNamed: oldChanges name)
>>>
>>> Waouh, no MultiByteFileStream here, so no more UTF-8.
>>> But hey, that would be the inverse problem: reading UTF-8 text with
>>> latin1 reader: I can't get an error doing this, only some strange
>>> sequence of characters... (The UTF-8 encoding)...
>>> Unless incriminated methods are further changed in #script376 or any
>>> other method... In which case they are written in latin1 in the
>>> changeLog...
>>> Hmm... That could be the case eventually. We must restart update
>>> process from 
>>> http://gforge.inria.fr/frs/download.php/22167/Pharo0.1Core-10296cl-2.zip
>>>
>>> One thing is sure, at next returnFromSnapshot, FileDirectory
>>> class>>startup will reopen changes UTF-8.
>>> So saving the image will reopen UTF-8...
>>>
>>> But wait... Maybe we get enough pieces of the puzzle:
>>> Analyzing the Pharo0.1Core-10304cl.changes tells that Stephane
>>> applied
>>> several updates before snapshoting the image. So if Kernel and
>>> System-Support are changed between 10298 and 10304, then we get the
>>> explanation:
>>> - condense changes put all in the .changes in UTF-8 but reopen the
>>> changes in latin1
>>> - further updates up to 10304 write changes in latin1
>>> - image snapshot reopen changes in UTF-8 and thus we get further
>>> invalid UTF-8...
>>>
>>> That's easy to reproduce. Stef, can you confirm?
>>>
>>> That also explain why I did not get the problem at home: I update
>>> early and always save my image after.
>>> After that we still have to detect and clean while Monticello  
>>> sources
>>> are interpreted UTF-8 when they should not (FIRST TRACK) , and
>>> eventually make source code go UTF-8 in Monticello, so that non  
>>> latin
>>> programmers can use their favourite language eventually...
>>>
>>> Nicolas
>>>
>>> 2009/5/23 Stéphane Ducasse <[email protected]>:
>>>> No problem I never interpreted it like that.
>>>> Me too I want a system that is working
>>>>
>>>> Adrian I will publish a fix for DNU now
>>>> and I will try later to check the fixes proposed by yoshiki
>>>>
>>>> stef
>>>>
>>>> On May 23, 2009, at 1:29 PM, Tudor Girba wrote:
>>>>
>>>>> Actually, the fix is even simpler: if you find a method that  
>>>>> raises
>>>>> "invalid utf8 input detected", just browse to it with a class
>>>>> browser,
>>>>> and re-accept it :).
>>>>>
>>>>> With my previous mail, I was not implying that someone should fix
>>>>> it
>>>>> for me, I was merely asking for what could a quick solution be,
>>>>> because I was a bit lost (scared) :). Now, I am happy. Thanks for
>>>>> discussing it.
>>>>>
>>>>> Cheers,
>>>>> Doru
>>>>>
>>>>> On 23 May 2009, at 13:07, Tudor Girba wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I attached here a DNU implementation I took from an older image.
>>>>>> After filing this one in, I can debug DNU problems.
>>>>>>
>>>>>> Cheers,
>>>>>> Doru
>>>>>>
>>>>>> <Object-doesNotUnderstand.st>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 23 May 2009, at 13:04, Stéphane Ducasse wrote:
>>>>>>
>>>>>>> I did the following
>>>>>>>
>>>>>>> (Object>>#doesNotUNderstand) getSourceFromFile and I get an
>>>>>>> invalid....
>>>>>>>
>>>>>>> Now when I take another method
>>>>>>>
>>>>>>> (BalloonFontTest>>#testDefaultFont) I do not get problem.
>>>>>>>
>>>>>>> I will reread carefully the mails of nicolas to try to
>>>>>>> understand,
>>>>>>> I do not know if the fixes of yoh
>>>>>>>
>>>>>>>   http://bugs.squeak.org/view.php?id=5996
>>>>>>> is related.
>>>>>>>
>>>>>>> Nicolas
>>>>>>>
>>>>>>>>> {Object>>#doesNotUnderstand:.
>>>>>>>>> SystemNavigation>>#browseMethodsWhoseNamesContain:.
>>>>>>>>> Utilities class>>#changeStampPerSe.
>>>>>>>>> Utilities class>>#methodsWithInitials:} collect: [:e | (e
>>>>>>>>> getSourceFromFile select: [:s | s charCode > 127]) asArray
>>>>>>>>> collect:
>>>>>>>>> [:c | c charCode]]
>>>>>>>
>>>>>>> I cannot get that code running it break before with me.
>>>>>>>
>>>>>>> Stef
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Pharo-project mailing list
>>>>>>> [email protected]
>>>>>>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
>>>>>>
>>>>>> --
>>>>>> www.tudorgirba.com
>>>>>>
>>>>>> "Not knowing how to do something is not an argument for how it
>>>>>> cannot be done."
>>>>>>
>>>>>> _______________________________________________
>>>>>> Pharo-project mailing list
>>>>>> [email protected]
>>>>>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-
>>>>>> project
>>>>>
>>>>> --
>>>>> www.tudorgirba.com
>>>>>
>>>>> "Problem solving efficiency grows with the abstractness level of
>>>>> problem understanding."
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Pharo-project mailing list
>>>>> [email protected]
>>>>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo- 
>>>>> project
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Pharo-project mailing list
>>>> [email protected]
>>>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
>>>>
>>>
>>
>> _______________________________________________
>> Pharo-project mailing list
>> [email protected]
>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
>
>
> _______________________________________________
> Pharo-project mailing list
> [email protected]
> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
>


_______________________________________________
Pharo-project mailing list
[email protected]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

Re: [Pharo-project] invalid utf8 input detected

Reply via email to