Connector for crawling Wikis
----------------------------

                 Key: CONNECTORS-256
                 URL: https://issues.apache.org/jira/browse/CONNECTORS-256
             Project: ManifoldCF
          Issue Type: New Feature
    Affects Versions: ManifoldCF 0.4
            Reporter: Karl Wright
            Assignee: Karl Wright
             Fix For: ManifoldCF 0.4


People have been trying to crawl wikis with ManifoldCF, but using the generic 
crawler is not a good way to do this.  Instead, it looks like we really could 
use a wiki connector, which would understand the wiki API and thus crawl wiki 
content quickly and effectively.

Some pertinent API references follow:

I don't know if it is possible to link to a wiki document with just the pageid, 
but it is possible to to get the url for the referring pageid via api:
http://en.wikipedia.org/w/api.php?action=query&prop=info&pageids=27697087&inprop=url

It is possible to get the metadata of a document using the pages id (instead of 
title) directly:
Titel -> 
http://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=API&rvprop=timestamp|user|comment|content
PageID -> 
http://en.wikipedia.org/w/api.php?action=query&prop=revisions&pageids=27697087&rvprop=timestamp|user|comment|content



- There needs to be some notion of an overall list of pages:
       - http://www.mediawiki.org/wiki/API:Allpages
       - Example: 
http://en.wikipedia.org/w/api.php?action=query&list=allpages&apfrom=Kre&aplimit=5

- Metadata information (author and pub date) also needs to be separated out in 
some way:
       - http://www.mediawiki.org/wiki/API:Properties#Revisions:_Example
       - Example:  
http://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=API|Main%20Page&rvprop=timestamp|user|comment|content



--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to