2013/7/12 Lars Buitinck <l.j.buiti...@uva.nl>: > 2013/7/12 Antonio Manuel Macías Ojeda <antonio.macias.oj...@gmail.com>: >> I'm not sure how are you using it but something to take into account is that >> the default NLTK tokenizer is meant to be used on sentences, not on whole >> paragraphs or documents, so it should operate on the output of a sentence >> tokenizer not on the raw text. Also it should be either pure ascii or >> unicode, not encoded strings. >> >> Another consideration is that by default it outputs tokens in Penn Treebank >> compatible format, which might be an overkill for your use case. NLTK >> provides simpler/faster tokenizers too, in case you want something more than >> splitting on whitespace/punctuation but don't want to sacrifice a lot of >> performance for it. > > I know what the tokenizer does; I was in fact feeding it sentences > because I was doing clustering at the word level and I needed precise > tokenization. I also submitted a patch that made it twice as fast, > which was pulled yesterday (https://github.com/nltk/nltk/pull/434). > > My general point is that NLTK is not written for speed. It's nice for > learning, but its algorithms are rarely fast enough to be used online, > and even in batch settings I tend to use it only for prototyping. > >>> On 12 July 2013 09:48, Lars Buitinck <l.j.buiti...@uva.nl> wrote: >>>> >>>> 2013/7/11 Tom Fawcett <tom.fawc...@gmail.com>: >>>> [...] >>>> >>>> I guess because it's terribly slow. I recently tried to cluster a >>>> sample of Wikipedia text at the word level. >>> >>> What kind of results did you get? I did some work recently clustering >>> short-form text and was generally unimpressed with the results. > > Pretty good results actually. I was clustering these words to get > extra features for a NER tagger, which immediately got a boost in F1 > score.
Interesting. Do you run a clustering algorithm for each individual words or do cluster POS tag context for all the center words at once? How many cluster do you extract? Have you tried any heuristics to find the "true" number of clusters or do you just over allocate n_cluster and let the supervised model that will use the cluster activation features deal with an overcomplete feature space? -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel ------------------------------------------------------------------------------ See everything from the browser to the database with AppDynamics Get end-to-end visibility with application monitoring from AppDynamics Isolate bottlenecks and diagnose root cause in seconds. Start your free trial of AppDynamics Pro today! http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general