[
https://issues.apache.org/jira/browse/CONNECTORS-256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13120623#comment-13120623
]
Karl Wright commented on CONNECTORS-256:
----------------------------------------
For the "allpages" one, aplimit must be set less than 500. This means that the
enumeration must occur in chunks of 500, and we have to keep track of the last
title before issuing the next query.
> Connector for crawling Wikis
> ----------------------------
>
> Key: CONNECTORS-256
> URL: https://issues.apache.org/jira/browse/CONNECTORS-256
> Project: ManifoldCF
> Issue Type: New Feature
> Components: Wiki connector
> Affects Versions: ManifoldCF 0.4
> Reporter: Karl Wright
> Assignee: Karl Wright
> Fix For: ManifoldCF 0.4
>
>
> People have been trying to crawl wikis with ManifoldCF, but using the generic
> crawler is not a good way to do this. Instead, it looks like we really could
> use a wiki connector, which would understand the wiki API and thus crawl wiki
> content quickly and effectively.
> Some pertinent API references follow:
> I don't know if it is possible to link to a wiki document with just the
> pageid, but it is possible to to get the url for the referring pageid via api:
> http://en.wikipedia.org/w/api.php?action=query&prop=info&pageids=27697087&inprop=url
> It is possible to get the metadata of a document using the pages id (instead
> of title) directly:
> Titel ->
> http://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=API&rvprop=timestamp|user|comment|content
> PageID ->
> http://en.wikipedia.org/w/api.php?action=query&prop=revisions&pageids=27697087&rvprop=timestamp|user|comment|content
> - There needs to be some notion of an overall list of pages:
> - http://www.mediawiki.org/wiki/API:Allpages
> - Example:
> http://en.wikipedia.org/w/api.php?action=query&list=allpages&apfrom=Kre&aplimit=5
> - Metadata information (author and pub date) also needs to be separated out
> in some way:
> - http://www.mediawiki.org/wiki/API:Properties#Revisions:_Example
> - Example:
> http://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=API|Main%20Page&rvprop=timestamp|user|comment|content
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira