Connector for crawling Wikis ---------------------------- Key: CONNECTORS-256 URL: https://issues.apache.org/jira/browse/CONNECTORS-256 Project: ManifoldCF Issue Type: New Feature Affects Versions: ManifoldCF 0.4 Reporter: Karl Wright Assignee: Karl Wright Fix For: ManifoldCF 0.4
People have been trying to crawl wikis with ManifoldCF, but using the generic crawler is not a good way to do this. Instead, it looks like we really could use a wiki connector, which would understand the wiki API and thus crawl wiki content quickly and effectively. Some pertinent API references follow: I don't know if it is possible to link to a wiki document with just the pageid, but it is possible to to get the url for the referring pageid via api: http://en.wikipedia.org/w/api.php?action=query&prop=info&pageids=27697087&inprop=url It is possible to get the metadata of a document using the pages id (instead of title) directly: Titel -> http://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=API&rvprop=timestamp|user|comment|content PageID -> http://en.wikipedia.org/w/api.php?action=query&prop=revisions&pageids=27697087&rvprop=timestamp|user|comment|content - There needs to be some notion of an overall list of pages: - http://www.mediawiki.org/wiki/API:Allpages - Example: http://en.wikipedia.org/w/api.php?action=query&list=allpages&apfrom=Kre&aplimit=5 - Metadata information (author and pub date) also needs to be separated out in some way: - http://www.mediawiki.org/wiki/API:Properties#Revisions:_Example - Example: http://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=API|Main%20Page&rvprop=timestamp|user|comment|content -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira