Hi Benjamin, On 05 Oct 2014, at 22:37, Benjamin Pollack <benja...@bitquabit.com> wrote:
> On Sun, 05 Oct 2014 10:36:31 -0400, Sven Van Caekenberghe <s...@stfx.eu> > wrote: > >> How come you got WideStrings ? >> What does the input look like, can you give a partial example ? > > I'm guessing I got WideStrings because the file is indeed in UTF-8, with lots > of characters outside the lower 128 code points. A sample couple of lines > might look like > > 光田康典,The National,Sarah McLachlan,周杰倫,Indochine,Rise Against,City and > Colour,Cæcilie Norby,El Cumbanchero,Death Letter > The Beatles,The Who,Barenaked Ladies,The Doors,Bob Dylan > > These are two play lists, one per line; each comma-delimited element is a > band on that play list. > > The full line for reading and tokenization is just "pathToFile > asFileReference contents lines collect: [ :line | (',' split: line) collect: > [ :ea | ea trimmed ]]." Based on the profile indicating that a lot of time > is lost on things like WideString>>copyFrom:to:, I wasn't optimistic about > trying to stream the contents instead of just calling "contents lines", but I > admit I didn't try. > > --Benjamin Working with WideStrings is way slower than working with ByteStrings, there is just no way around that. What is especially slow is the automagic switch from ByteString to WideString, for example in a String>>#streamContents: because a #becomeForward: is involved. If that happens for every line or every token, that would be crazy. Apart from that, the tokenisation is not very efficient, #lines is a copy of your whole contents, so is the #split: and #trimmed. The algorithm sounds a bit lazy as well, writing it 'on purpose' with an eye for performance might yield better results. But I guess this is not really an exercise in optimisation. If it is, you should give us the dataset and code (and maybe runnable python code as reference), with some comments. Sven