Hi Benjamin,

On 05 Oct 2014, at 22:37, Benjamin Pollack <benja...@bitquabit.com> wrote:

> On Sun, 05 Oct 2014 10:36:31 -0400, Sven Van Caekenberghe <s...@stfx.eu> 
> wrote:
> 
>> How come you got WideStrings ?
>> What does the input look like, can you give a partial example ?
> 
> I'm guessing I got WideStrings because the file is indeed in UTF-8, with lots 
> of characters outside the lower 128 code points.  A sample couple of lines 
> might look like
> 
>       光田康典,The National,Sarah McLachlan,周杰倫,Indochine,Rise Against,City and 
> Colour,Cæcilie Norby,El Cumbanchero,Death Letter
>       The Beatles,The Who,Barenaked Ladies,The Doors,Bob Dylan
> 
> These are two play lists, one per line; each comma-delimited element is a 
> band on that play list.
> 
> The full line for reading and tokenization is just "pathToFile 
> asFileReference contents lines collect: [ :line | (',' split: line) collect: 
> [ :ea | ea trimmed ]]."  Based on the profile indicating that a lot of time 
> is lost on things like WideString>>copyFrom:to:, I wasn't optimistic about 
> trying to stream the contents instead of just calling "contents lines", but I 
> admit I didn't try.
> 
> --Benjamin

Working with WideStrings is way slower than working with ByteStrings, there is 
just no way around that. What is especially slow is the automagic switch from 
ByteString to WideString, for example in a String>>#streamContents: because a 
#becomeForward: is involved. If that happens for every line or every token, 
that would be crazy.

Apart from that, the tokenisation is not very efficient, #lines is a copy of 
your whole contents, so is the #split: and #trimmed. The algorithm sounds a bit 
lazy as well, writing it 'on purpose' with an eye for performance might yield 
better results.

But I guess this is not really an exercise in optimisation. If it is, you 
should give us the dataset and code (and maybe runnable python code as 
reference), with some comments.

Sven


Reply via email to