On Jan 14, 2010, at 6:11 AM, Robin Anil wrote:
> On the question of analyzer quality. (Assuming speed could be circumvented
> by madding more machines)
>
> Wikipedia data is in wikitext format
>
> so there are many {{Title}} [[Link|LinkText]] some html tagsThere is a Wikipedia Tokenizer in Lucene already that can deal with those for the most part. It is a derivative of the StandardTokenizer.
