Re: [Pharo-project] invalid utf8 input detected

Stéphane Ducasse Sat, 23 May 2009 10:52:51 -0700

HI nicolas

I was reading the changes of yoshiki I will integrate but indeed this  
is not for our case.
My reply below... I tried to follow :)


> What happened exactly is very hard to trace because these FileStream
> are a can of worms...
> Here are some of my perigrinations:
>
> FIRST POSSIBLE TRACK:
>
> All methods were changed in 10305.
> Monticello snapshot/source.st is not UTF-8.
> If the file is opened UTF-8, then we get decompiledCode, I don't  
> know why yet...
> But the changes still go into the change log in correct UTF-8 form, so
> that's just another bug, but not the real source of the problem.
> For getting some worms out of the can just browse inst var defs of
> converter in MultiByteFileStream:
> The accessor #converter initialize converter with TextConverter
> defaultSystemConverter which depends on LanguageEnvironment.
> That is a Latin1TextConverter in my latin image.
> Unless #reset is called first, in which case it will initialize with a
> UTF8TextConverter.
> Yes, but open: fileName forWrite: writeMode, does the job too with a
> UTF8TextConverter.
> You still follow? me neither.
> A better behaved is #setConverterForCode that should let non UTF-8
> .mcz work in UTF-8 environment, but not sure if called where
> required...
> I think Yoshiki changes are necessary only for writing source code
> with character code > 255.
> This was not the case of incriminated methods.
>
> SECOND POSSIBLE TRACK:
>
> Everything going to the change log pass thru the MultiByteFileStream,
> so how did non UTF-8 characters went in?
> I tried to follow two other clues:
> 1) There are senders of #primWrite:from:startingAt:count: not
> redefined in MultiByteFileStream...
>  for example, using #next:putAll:startingAt: will bypass the  
> converter.
> 2) using nextPutAll: with a ByteArray argument also does bypass the
> converter (See MultiByteFileStream>>#nextPutAll:)
> I did not find the senders (you really believe senders of nextPutAll:
> can be analyzed?).
> I tried to instrument code with Notification, but I'm unable to
> reproduce the problem, so that was vain...
>
> THIRD POSSIBLE TRACK:
>
> http://gforge.inria.fr/frs/download.php/22283/Pharo0.1Core-10304cl.zip
> has the invalid UTF-8 problem, just before 10305 changes that
> introduced decompiled code...
> So we might attack the problem with another code snippet:
>
> (SystemNavigation default browseAllCallsOn: (Smalltalk associationAt:
> #SourceFiles))...
>
> Hmm, I might have a better clue now.
> The problem might possibly come from the condenseChanges in  
> update10298.
> What happen in a condenseChanges?
> Changes are copied to this file:
>
> f := FileStream fileNamed: 'ST80.temp'.
>
> So far, so good, because the concreteStream is a MultiByteFileStream.
>
> But the end finishes with:
>
>       SourceFiles
>               at: 2
>               put: (StandardFileStream oldFileNamed: oldChanges name)
>
> Waouh, no MultiByteFileStream here, so no more UTF-8.
> But hey, that would be the inverse problem: reading UTF-8 text with
> latin1 reader: I can't get an error doing this, only some strange
> sequence of characters... (The UTF-8 encoding)...
> Unless incriminated methods are further changed in #script376 or any
> other method... In which case they are written in latin1 in the
> changeLog...
> Hmm... That could be the case eventually. We must restart update
> process from 
> http://gforge.inria.fr/frs/download.php/22167/Pharo0.1Core-10296cl-2.zip
>
> One thing is sure, at next returnFromSnapshot, FileDirectory
> class>>startup will reopen changes UTF-8.
> So saving the image will reopen UTF-8...
>
> But wait... Maybe we get enough pieces of the puzzle:
> Analyzing the Pharo0.1Core-10304cl.changes tells that Stephane applied
> several updates before snapshoting the image. So if Kernel and
> System-Support are changed between 10298 and 10304, then we get the
> explanation:
> - condense changes put all in the .changes in UTF-8 but reopen the
> changes in latin1
> - further updates up to 10304 write changes in latin1
> - image snapshot reopen changes in UTF-8 and thus we get further
> invalid UTF-8...
>
> That's easy to reproduce. Stef, can you confirm?

how do you want me to confirm?
That I redo the image. What we can do is change the update method to  
block the update at a certain number.

> That also explain why I did not get the problem at home: I update
> early and always save my image after.
> After that we still have to detect and clean while Monticello sources
> are interpreted UTF-8 when they should not (FIRST TRACK) , and
> eventually make source code go UTF-8 in Monticello, so that non latin
> programmers can use their favourite language eventually...
>
> Nicolas
>
> 2009/5/23 Stéphane Ducasse <[email protected]>:
>> No problem I never interpreted it like that.
>> Me too I want a system that is working
>>
>> Adrian I will publish a fix for DNU now
>> and I will try later to check the fixes proposed by yoshiki
>>
>> stef
>>
>> On May 23, 2009, at 1:29 PM, Tudor Girba wrote:
>>
>>> Actually, the fix is even simpler: if you find a method that raises
>>> "invalid utf8 input detected", just browse to it with a class  
>>> browser,
>>> and re-accept it :).
>>>
>>> With my previous mail, I was not implying that someone should fix it
>>> for me, I was merely asking for what could a quick solution be,
>>> because I was a bit lost (scared) :). Now, I am happy. Thanks for
>>> discussing it.
>>>
>>> Cheers,
>>> Doru
>>>
>>> On 23 May 2009, at 13:07, Tudor Girba wrote:
>>>
>>>> Hi,
>>>>
>>>> I attached here a DNU implementation I took from an older image.
>>>> After filing this one in, I can debug DNU problems.
>>>>
>>>> Cheers,
>>>> Doru
>>>>
>>>> <Object-doesNotUnderstand.st>
>>>>
>>>>
>>>>
>>>> On 23 May 2009, at 13:04, Stéphane Ducasse wrote:
>>>>
>>>>> I did the following
>>>>>
>>>>> (Object>>#doesNotUNderstand) getSourceFromFile and I get an
>>>>> invalid....
>>>>>
>>>>> Now when I take another method
>>>>>
>>>>> (BalloonFontTest>>#testDefaultFont) I do not get problem.
>>>>>
>>>>> I will reread carefully the mails of nicolas to try to understand,
>>>>> I do not know if the fixes of yoh
>>>>>
>>>>>    http://bugs.squeak.org/view.php?id=5996
>>>>> is related.
>>>>>
>>>>> Nicolas
>>>>>
>>>>>>> {Object>>#doesNotUnderstand:.
>>>>>>> SystemNavigation>>#browseMethodsWhoseNamesContain:.
>>>>>>> Utilities class>>#changeStampPerSe.
>>>>>>> Utilities class>>#methodsWithInitials:} collect: [:e | (e
>>>>>>> getSourceFromFile select: [:s | s charCode > 127]) asArray
>>>>>>> collect:
>>>>>>> [:c | c charCode]]
>>>>>
>>>>> I cannot get that code running it break before with me.
>>>>>
>>>>> Stef
>>>>>
>>>>> _______________________________________________
>>>>> Pharo-project mailing list
>>>>> [email protected]
>>>>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo- 
>>>>> project
>>>>
>>>> --
>>>> www.tudorgirba.com
>>>>
>>>> "Not knowing how to do something is not an argument for how it
>>>> cannot be done."
>>>>
>>>> _______________________________________________
>>>> Pharo-project mailing list
>>>> [email protected]
>>>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
>>>
>>> --
>>> www.tudorgirba.com
>>>
>>> "Problem solving efficiency grows with the abstractness level of
>>> problem understanding."
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Pharo-project mailing list
>>> [email protected]
>>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
>>>
>>
>>
>> _______________________________________________
>> Pharo-project mailing list
>> [email protected]
>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
>>
>
> _______________________________________________
> Pharo-project mailing list
> [email protected]
> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
>


_______________________________________________
Pharo-project mailing list
[email protected]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

Re: [Pharo-project] invalid utf8 input detected

Reply via email to