2010/1/14 Robin Anil <[email protected]>:
> On the question of analyzer quality. (Assuming speed could be circumvented
> by madding more machines)
>
> Wikipedia data is in wikitext format
>
> so there are many {{Title}} [[Link|LinkText]] some html tags
>
> Should I be writing my own stream based analyzer maybe some regex rules to
> filter them will do?

In my case I need smart wikipedia parsing since I want to extract the
links info  (label and positions as output text annotations with
offset in characters). As the JFlex scanner used by lucene does not
support this and I don't have time to try and extend it I have fallen
back to the following media wiki parser:
http://code.google.com/p/gwtwiki/wiki/Mediawiki2HTML It is probably a
lot slower than the lucene analyzer though I haven't tried to
benchmark it yet.

Here is the current (largely unfinished) state of my clojure utility
to wrap this up:

  http://github.com/ogrisel/corpusmaker

In particular you might want to have a look at:

  
http://github.com/ogrisel/corpusmaker/blob/master/src/corpusmaker/CorpusMakerTextConverter.java

  
http://github.com/ogrisel/corpusmaker/blob/master/src/corpusmaker/wikipedia.clj#L105

-- 
Olivier
http://twitter.com/ogrisel - http://code.oliviergrisel.name

Reply via email to