Re: DictionaryVectorizer meets Wikipedia.

Grant Ingersoll Thu, 14 Jan 2010 03:54:16 -0800

On Jan 14, 2010, at 6:11 AM, Robin Anil wrote:

> On the question of analyzer quality. (Assuming speed could be circumvented
> by madding more machines)
> 
> Wikipedia data is in wikitext format
> 
> so there are many {{Title}} [[Link|LinkText]] some html tags


There is a Wikipedia Tokenizer in Lucene already that can deal with those for 
the most part.  It is a derivative of the StandardTokenizer.

Re: DictionaryVectorizer meets Wikipedia.

Reply via email to