2010/1/14 Robin Anil <[email protected]>: > On the question of analyzer quality. (Assuming speed could be circumvented > by madding more machines) > > Wikipedia data is in wikitext format > > so there are many {{Title}} [[Link|LinkText]] some html tags > > Should I be writing my own stream based analyzer maybe some regex rules to > filter them will do?
In my case I need smart wikipedia parsing since I want to extract the links info (label and positions as output text annotations with offset in characters). As the JFlex scanner used by lucene does not support this and I don't have time to try and extend it I have fallen back to the following media wiki parser: http://code.google.com/p/gwtwiki/wiki/Mediawiki2HTML It is probably a lot slower than the lucene analyzer though I haven't tried to benchmark it yet. Here is the current (largely unfinished) state of my clojure utility to wrap this up: http://github.com/ogrisel/corpusmaker In particular you might want to have a look at: http://github.com/ogrisel/corpusmaker/blob/master/src/corpusmaker/CorpusMakerTextConverter.java http://github.com/ogrisel/corpusmaker/blob/master/src/corpusmaker/wikipedia.clj#L105 -- Olivier http://twitter.com/ogrisel - http://code.oliviergrisel.name
