2015-03-17 21:10 GMT+01:00 Nicolas Cellier <
[email protected]>:

> We can already have relatively fast UTF8 reading when the file is
> essentially composed of ascii.
> I've demonstrated this a few years ago with SqueaXtreams (
> http://www.squeaksource.com/XTream/)
>
> For example, see
> http://lists.squeakfoundation.org/pipermail/squeak-dev/2009-December/141577.html
>
> I don't want to sell SqueaXtreams, it was just an experiment, but the good
> ideas should be extracted, we don't have to wait for a new VM.
>
> Nicolas Cellier (eh, I now have to disambiguate, maybe i will sign NiCe...)
>
>
I've also found this  -
http://lists.squeakfoundation.org/pipermail/squeak-dev/2010-September/153389.html
- only the end counts, quoted below:


But wait, the file was buffered (bytes are fetched from file by packets),
but the decoder was not! All decoding is performed char by char.
That's bad, because when only a few bytes require decoding and
majority can be translated unchanged to String, there is potentially a
major speed up by simple using a sub-array copy primitive. We know
this since #squeakToUTF8, many thanks to Andreas.
To profit by buffering for decoder too, just use a message to wrap it up:

    {
    [| tmp |
        tmp := (MultiByteFileStream readOnlyFileNamed: (SourceFiles
at: 2) name) ascii;
                wantsLineEndConversion: false; converter: UTF8TextConverter
new.
        1 to: 20000 do: [:i | tmp upTo: Character cr].
        tmp close] timeToRun.
    [| tmp |
        tmp := ((StandardFileStream readOnlyFileNamed: (SourceFiles at: 2)
name)
                readXtream binary buffered ~> UTF8Decoder) buffered.
        1 to: 20000 do: [:i | tmp upTo: Character cr].
        tmp close] timeToRun.
    }
   #(152 18)
   #(152 19)

Bingo, now the speed up is there too! 7.5x is not a bad score afterall.
That's not amazing, the change log is essentially made of ASCII and
does rarely require any UTF8 translation at all.
Of course, if you handle files full of chinese code points, don't
expect a speed up at all!
But for a decent proportion of latin character users, the potential
speed-up is there, right under our Streams.



>
> 2015-03-17 19:58 GMT+01:00 [email protected] <[email protected]>:
>
>>
>> Le 17 mars 2015 18:59, "Stephan Eggermont" <[email protected]> a écrit :
>> >
>> > On 17/03/15 17:59, Stephan Eggermont wrote:
>> >
>> >> I tried the code with the latest pharo-spur image and vm:
>> >> from 17 seconds down to 10.5
>> >
>> >
>> > And I tried it again with cogspurlinux 3268 and the
>> > trunk46-spur. That needed switching to
>> >
>> > MultiByteFileStream readOnlyFileNamed:
>> >
>> > and ran in 8.8 sec (average of three runs). Interestingly, on Pharo
>> > that is significantly slower, about 15 sec.
>> >
>> > replacing that by StandardFileStream (and no decoding) reduced it to
>> > <120 ms.
>>
>> Maybe using FFI to read the file in the correct format would be a nice
>> option to have available.
>>
>> The code in the MultibyteFileStream looks quite convoluted when reading.
>>
>> There is reason we need a 64 it VM to read a UTF8 file fast. (8secs
>> really does not qualify)
>>
>> Opinions?
>>
>> Phil
>>
>> >
>> > Stephan
>> >
>> >
>> > [1 to: 10 do: [:j | |a length|
>> >   length := 0.
>> >   a :=
>> > (MultiByteFileStream readOnlyFileNamed:
>> '/home/stephan/Downloads/jfreechart.mse') readStream contents]]
>> timeToRunWithoutGC
>> >
>> >
>> >
>> >
>>
>
>

Reply via email to