2011/6/22 Jörn Kottmann <[email protected]>:
> On 6/22/11 10:45 AM, Olivier Grisel wrote:
>>
>> I will (soon?) include a couple of new scripts in pignlproc to extract
>> occurrence contexts of any kind of entities occurring as wikilinks in
>> Wikipedia dumps to load those in a Solr index. I will let you know
>> when that happens.
>
> We definitely need some code to parse the wikipedia articles.
> How do you transform the wiki text to plain text in pignlproc?

I use a mediawiki markup parser from gwtwiki: https://code.google.com/p/gwtwiki/

The API is a non intuitive to use but when I searched for a good
mediawiki parser it was one of the best I found and that had a license
that would be compatible with ASF requirements for dependencies.

> Could we take a similar approach for the annotation project, or maybe
> even share the code which does it?

Sure, it is here (again the ITextConverter API imposed by gwtwiki is
not intuitive so focus on the convert / getWikiLinks methods as entry
points when reading the source code):

  
https://github.com/ogrisel/pignlproc/blob/master/src/main/java/pignlproc/markup/AnnotatingMarkupParser.java

I found empirically that it's able to process 1MB/s, hence roughly
require 1 day to process an English Wikipedia dump. Hence the use of
Apache Pig / Hadoop and EC2 for this kind of tasks: 20 machines => a
bit more that 1h to process the same dump in parallel with the same
pig script.

As said previously, I find Spark very, very promising and that might
be a more maintainable than pig as an integration target as it's is
also more suitable for interactive and iterative tasks as is the case
with NLP / machine learning stuff.

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Reply via email to