Indexing Wikipedia/MediaWiki

Wunderlich, Tobias Fri, 16 Sep 2011 00:54:04 -0700

Hey folks,

I am currently working on a project to create a basic search platform using 
Solr and ManifoldCF. One of the content-repositories I need to index is a wiki 
(MediaWiki) and that's where I ran into a wall. I tried using the 
web-connector, but simply crawling the sites resulted in a lot of content I 
don't need (navigation-links, ...) and not every information I wanted was 
gathered (author, last modified, ...). The only metadata I got was the one 
included in head/meta, which wasn't relevant.


Is there another way to get the wiki's data and more important is there a way 
to get the right data into the right field? I know that there is a way to 
export the wiki-sites in xml with wiki-syntax, but I don't know how that would 
help me. I could simply use solr's DataImportHandler to index a complete 
wiki-dump, but it would be nice to use the same framework for every repository, 
especially since manifold manages all the recrawling.

Does anybody have some experience in this direction, or any idea for a solution?

Thanks in advance,
Tobias

Indexing Wikipedia/MediaWiki

Reply via email to