Hey folks, I am currently working on a project to create a basic search platform using Solr and ManifoldCF. One of the content-repositories I need to index is a wiki (MediaWiki) and that's where I ran into a wall. I tried using the web-connector, but simply crawling the sites resulted in a lot of content I don't need (navigation-links, ...) and not every information I wanted was gathered (author, last modified, ...). The only metadata I got was the one included in head/meta, which wasn't relevant.
Is there another way to get the wiki's data and more important is there a way to get the right data into the right field? I know that there is a way to export the wiki-sites in xml with wiki-syntax, but I don't know how that would help me. I could simply use solr's DataImportHandler to index a complete wiki-dump, but it would be nice to use the same framework for every repository, especially since manifold manages all the recrawling. Does anybody have some experience in this direction, or any idea for a solution? Thanks in advance, Tobias
