2011/6/22 Jörn Kottmann <[email protected]>: > On 6/22/11 10:45 AM, Olivier Grisel wrote: >> >> I will (soon?) include a couple of new scripts in pignlproc to extract >> occurrence contexts of any kind of entities occurring as wikilinks in >> Wikipedia dumps to load those in a Solr index. I will let you know >> when that happens. > > We definitely need some code to parse the wikipedia articles. > How do you transform the wiki text to plain text in pignlproc?
I use a mediawiki markup parser from gwtwiki: https://code.google.com/p/gwtwiki/ The API is a non intuitive to use but when I searched for a good mediawiki parser it was one of the best I found and that had a license that would be compatible with ASF requirements for dependencies. > Could we take a similar approach for the annotation project, or maybe > even share the code which does it? Sure, it is here (again the ITextConverter API imposed by gwtwiki is not intuitive so focus on the convert / getWikiLinks methods as entry points when reading the source code): https://github.com/ogrisel/pignlproc/blob/master/src/main/java/pignlproc/markup/AnnotatingMarkupParser.java I found empirically that it's able to process 1MB/s, hence roughly require 1 day to process an English Wikipedia dump. Hence the use of Apache Pig / Hadoop and EC2 for this kind of tasks: 20 machines => a bit more that 1h to process the same dump in parallel with the same pig script. As said previously, I find Spark very, very promising and that might be a more maintainable than pig as an integration target as it's is also more suitable for interactive and iterative tasks as is the case with NLP / machine learning stuff. -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel
