It's easy to register with Jira and I encourage you to do so. Nevertheless, I created the ticket: CONNECTORS-256.
Karl On Mon, Sep 19, 2011 at 8:21 AM, Wunderlich, Tobias <tobias.wunderl...@igd-r.fraunhofer.de> wrote: > I don't know if it is possible to link to a wiki document with just the > pageid, but it is possible to to get the url for the referring pageid via api: > http://en.wikipedia.org/w/api.php?action=query&prop=info&pageids=27697087&inprop=url > > A new connector for crawling Wikis sounds great. Could you create a new > ticket? I'm not registered at jira yet ... > > Tobias > > > -----Ursprüngliche Nachricht----- > Von: Karl Wright [mailto:daddy...@gmail.com] > Gesendet: Montag, 19. September 2011 12:39 > An: connectors-user@incubator.apache.org > Betreff: Re: Indexing Wikipedia/MediaWiki > > The only thing that concerns me about using a document's title as its > document identifier in ManifoldCF is the possibility of it being renamed. > For that reason the Page ID is preferable. But it doesn't sound like bad > things would happen either way. > > I'd like to suggest creating a JIRA ticket to describe a new connector for > crawling Wiki's. Then I may create a branch in which to work on this. We're > coming into conference season so it may be some weeks before there's a > connector to try, though. > > Karl > > On Mon, Sep 19, 2011 at 6:07 AM, Wunderlich, Tobias > <tobias.wunderl...@igd-r.fraunhofer.de> wrote: >> (1) How do you form a URL that would take a user to a document? Does it >> use the title, or does it use the page ID? >> I guess one way would be to just add the title to the main-url, like >> http://en.wikipedia.org/wiki/<title>. However, I did not find out how to >> create a url to the document via pageid yet. >> >> >> (2) If the URL includes the page ID, is there any way to get metadata >> information about the document using the page ID directly? It probably >> wouldn't be the query feature that would do this, btw. >> >> It is possible to get the metadata of a document using the pages id (instead >> of title) directly: >> Titel -> >> http://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=A >> PI&rvprop=timestamp|user|comment|content >> PageID -> >> http://en.wikipedia.org/w/api.php?action=query&prop=revisions&pageids= >> 27697087&rvprop=timestamp|user|comment|content >> >> >> Tobias >> >> >> -----Ursprüngliche Nachricht----- >> Von: Karl Wright [mailto:daddy...@gmail.com] >> Gesendet: Montag, 19. September 2011 11:35 >> An: connectors-user@incubator.apache.org >> Betreff: Re: Indexing Wikipedia/MediaWiki >> >> The API seems to be built around using Titles as document keys, and yet >> there is a page ID also, which would probably be better at looking up page >> data. So I have some new questions: >> >> (1) How do you form a URL that would take a user to a document? Does it use >> the title, or does it use the page ID? >> (2) If the URL includes the page ID, is there any way to get metadata >> information about the document using the page ID directly? It probably >> wouldn't be the query feature that would do this, btw. >> >> Thanks, >> Karl >> >> >> On Mon, Sep 19, 2011 at 5:09 AM, Wunderlich, Tobias >> <tobias.wunderl...@igd-r.fraunhofer.de> wrote: >>> Hey Karl, >>> >>> I did some research and the WikiMedia-API looks promising: >>> >>> - There needs to be some notion of an overall list of pages: >>> - http://www.mediawiki.org/wiki/API:Allpages >>> - Example: >>> http://en.wikipedia.org/w/api.php?action=query&list=allpages&apfrom=K >>> r >>> e&aplimit=5 >>> >>> - Metadata information (author and pub date) also needs to be separated out >>> in some way: >>> - >>> http://www.mediawiki.org/wiki/API:Properties#Revisions:_Example >>> - Example: >>> http://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles= >>> A >>> PI|Main%20Page&rvprop=timestamp|user|comment|content >>> >>> What do you think? >>> >>> Tobias >>> >>> >>> >>> -----Ursprüngliche Nachricht----- >>> Von: Karl Wright [mailto:daddy...@gmail.com] >>> Gesendet: Freitag, 16. September 2011 16:11 >>> An: Sumana Harihareswara >>> Cc: Wunderlich, Tobias >>> Betreff: Re: MediaWiki & Lucene development >>> >>> The lucene-search extension may or may not be appropriate for Tobias. >>> But my interest would extend towards getting wiki content into whatever >>> target a ManifoldCF sets up, not just Solr/Lucene. In order to do this the >>> following needs to be addressed: >>> >>> - There needs to be some notion of an overall list of pages, >>> preferably queryable by date and time of last change; >>> - We'd need access, per page, to authorization information >>> - Metadata information (author and pub date) also needs to be >>> separated out in some way >>> >>> The plugin that Tobias mentioned seems to do the last item fine, but not >>> the first two. Do you have a solution for those? >>> >>> Thanks, >>> Karl >>> >>> On Fri, Sep 16, 2011 at 9:40 AM, Sumana Harihareswara >>> <suma...@wikimedia.org> wrote: >>>> Hi. I happened to see you both discussing MediaWiki and >>>> search/indexing in a mailing list recently. >>>> >>>> You might be interested in asking your question to the >>>> MediaWiki/Wikimedia developers' list >>>> >>>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l >>>> >>>> and I'd also welcome any assistance in improving our Lucene search >>>> extension, which is used on Wikipedia: >>>> >>>> http://www.mediawiki.org/wiki/Extension:Lucene-search >>>> >>>> Thanks! >>>> >>>> -- >>>> Sumana Harihareswara >>>> Volunteer Development Coordinator >>>> Wikimedia Foundation >>>> >>> >> >