On Jan 14, 2010, at 6:11 AM, Robin Anil wrote:

> On the question of analyzer quality. (Assuming speed could be circumvented
> by madding more machines)
> 
> Wikipedia data is in wikitext format
> 
> so there are many {{Title}} [[Link|LinkText]] some html tags

There is a Wikipedia Tokenizer in Lucene already that can deal with those for 
the most part.  It is a derivative of the StandardTokenizer.

Reply via email to