I don't know if it is possible to link to a wiki document with just the pageid, 
but it is possible to to get the url for the referring pageid via api:
http://en.wikipedia.org/w/api.php?action=query&prop=info&pageids=27697087&inprop=url

A new connector for crawling Wikis sounds great. Could you create a new ticket? 
I'm not registered at jira yet ...

Tobias
 

-----Ursprüngliche Nachricht-----
Von: Karl Wright [mailto:daddy...@gmail.com] 
Gesendet: Montag, 19. September 2011 12:39
An: connectors-user@incubator.apache.org
Betreff: Re: Indexing Wikipedia/MediaWiki

The only thing that concerns me about using a document's title as its document 
identifier in ManifoldCF is the possibility of it being renamed.  For that 
reason the Page ID is preferable.  But it doesn't sound like bad things would 
happen either way.

I'd like to suggest creating a JIRA ticket to describe a new connector for 
crawling Wiki's.  Then I may create a branch in which to work on this.  We're 
coming into conference season so it may be some weeks before there's a 
connector to try, though.

Karl

On Mon, Sep 19, 2011 at 6:07 AM, Wunderlich, Tobias 
<tobias.wunderl...@igd-r.fraunhofer.de> wrote:
>  (1) How do you form a URL that would take a user to a document?  Does it use 
> the title, or does it use the page ID?
> I guess one way would be to just add the title to the main-url, like 
> http://en.wikipedia.org/wiki/<title>. However, I did not find out how to 
> create a url to the document via pageid yet.
>
>
>  (2) If the URL includes the page ID, is there any way to get metadata 
> information about the document using the page ID directly?  It probably 
> wouldn't be the query feature that would do this, btw.
>
> It is possible to get the metadata of a document using the pages id (instead 
> of title) directly:
> Titel -> 
> http://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=A
> PI&rvprop=timestamp|user|comment|content
> PageID -> 
> http://en.wikipedia.org/w/api.php?action=query&prop=revisions&pageids=
> 27697087&rvprop=timestamp|user|comment|content
>
>
> Tobias
>
>
> -----Ursprüngliche Nachricht-----
> Von: Karl Wright [mailto:daddy...@gmail.com]
> Gesendet: Montag, 19. September 2011 11:35
> An: connectors-user@incubator.apache.org
> Betreff: Re: Indexing Wikipedia/MediaWiki
>
> The API seems to be built around using Titles as document keys, and yet there 
> is a page ID also, which would probably be better at looking up page data.  
> So I have some new questions:
>
> (1) How do you form a URL that would take a user to a document?  Does it use 
> the title, or does it use the page ID?
> (2) If the URL includes the page ID, is there any way to get metadata 
> information about the document using the page ID directly?  It probably 
> wouldn't be the query feature that would do this, btw.
>
> Thanks,
> Karl
>
>
> On Mon, Sep 19, 2011 at 5:09 AM, Wunderlich, Tobias 
> <tobias.wunderl...@igd-r.fraunhofer.de> wrote:
>> Hey Karl,
>>
>> I did some research and the WikiMedia-API looks promising:
>>
>> - There needs to be some notion of an overall list of pages:
>>        - http://www.mediawiki.org/wiki/API:Allpages
>>        - Example:
>> http://en.wikipedia.org/w/api.php?action=query&list=allpages&apfrom=K
>> r
>> e&aplimit=5
>>
>> - Metadata information (author and pub date) also needs to be separated out 
>> in some way:
>>        -
>> http://www.mediawiki.org/wiki/API:Properties#Revisions:_Example
>>        - Example:
>> http://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=
>> A
>> PI|Main%20Page&rvprop=timestamp|user|comment|content
>>
>> What do you think?
>>
>> Tobias
>>
>>
>>
>> -----Ursprüngliche Nachricht-----
>> Von: Karl Wright [mailto:daddy...@gmail.com]
>> Gesendet: Freitag, 16. September 2011 16:11
>> An: Sumana Harihareswara
>> Cc: Wunderlich, Tobias
>> Betreff: Re: MediaWiki & Lucene development
>>
>> The lucene-search extension may or may not be appropriate for Tobias.
>> But my interest would extend towards getting wiki content into whatever 
>> target a ManifoldCF sets up, not just Solr/Lucene.  In order to do this the 
>> following needs to be addressed:
>>
>> - There needs to be some notion of an overall list of pages, 
>> preferably queryable by date and time of last change;
>> - We'd need access, per page, to authorization information
>> - Metadata information (author and pub date) also needs to be 
>> separated out in some way
>>
>> The plugin that Tobias mentioned seems to do the last item fine, but not the 
>> first two.  Do you have a solution for those?
>>
>> Thanks,
>> Karl
>>
>> On Fri, Sep 16, 2011 at 9:40 AM, Sumana Harihareswara 
>> <suma...@wikimedia.org> wrote:
>>> Hi.  I happened to see you both discussing MediaWiki and 
>>> search/indexing in a mailing list recently.
>>>
>>> You might be interested in asking your question to the 
>>> MediaWiki/Wikimedia developers' list
>>>
>>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>>>
>>> and I'd also welcome any assistance in improving our Lucene search 
>>> extension, which is used on Wikipedia:
>>>
>>> http://www.mediawiki.org/wiki/Extension:Lucene-search
>>>
>>> Thanks!
>>>
>>> --
>>> Sumana Harihareswara
>>> Volunteer Development Coordinator
>>> Wikimedia Foundation
>>>
>>
>

Reply via email to