2011/7/4 Jörn Kottmann <[email protected]>: > On 7/4/11 2:05 PM, Olivier Grisel wrote: >> >> Done. See my comment on >> https://issues.apache.org/jira/browse/OPENNLP-211 for additional info >> on the integration / usage. > > Thanks, doesn't seem that difficult to parse it. Hopefully we have quickly > a state where it is possible to import the wikinews data in to the corpus > server, the parsing might need a little fine tuning to give good results.
Keeping the correct link position from the original markup while cleaning it can be tricky though. Be careful when tweaking the parser. Maybe the Span helper classes from OpenNLP could help make this code more robust. -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel
