It's easy to register with Jira and I encourage you to do so.

Nevertheless, I created the ticket: CONNECTORS-256.

Karl


On Mon, Sep 19, 2011 at 8:21 AM, Wunderlich, Tobias
<tobias.wunderl...@igd-r.fraunhofer.de> wrote:
> I don't know if it is possible to link to a wiki document with just the 
> pageid, but it is possible to to get the url for the referring pageid via api:
> http://en.wikipedia.org/w/api.php?action=query&prop=info&pageids=27697087&inprop=url
>
> A new connector for crawling Wikis sounds great. Could you create a new 
> ticket? I'm not registered at jira yet ...
>
> Tobias
>
>
> -----Ursprüngliche Nachricht-----
> Von: Karl Wright [mailto:daddy...@gmail.com]
> Gesendet: Montag, 19. September 2011 12:39
> An: connectors-user@incubator.apache.org
> Betreff: Re: Indexing Wikipedia/MediaWiki
>
> The only thing that concerns me about using a document's title as its 
> document identifier in ManifoldCF is the possibility of it being renamed.  
> For that reason the Page ID is preferable.  But it doesn't sound like bad 
> things would happen either way.
>
> I'd like to suggest creating a JIRA ticket to describe a new connector for 
> crawling Wiki's.  Then I may create a branch in which to work on this.  We're 
> coming into conference season so it may be some weeks before there's a 
> connector to try, though.
>
> Karl
>
> On Mon, Sep 19, 2011 at 6:07 AM, Wunderlich, Tobias 
> <tobias.wunderl...@igd-r.fraunhofer.de> wrote:
>>  (1) How do you form a URL that would take a user to a document?  Does it 
>> use the title, or does it use the page ID?
>> I guess one way would be to just add the title to the main-url, like 
>> http://en.wikipedia.org/wiki/<title>. However, I did not find out how to 
>> create a url to the document via pageid yet.
>>
>>
>>  (2) If the URL includes the page ID, is there any way to get metadata 
>> information about the document using the page ID directly?  It probably 
>> wouldn't be the query feature that would do this, btw.
>>
>> It is possible to get the metadata of a document using the pages id (instead 
>> of title) directly:
>> Titel ->
>> http://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=A
>> PI&rvprop=timestamp|user|comment|content
>> PageID ->
>> http://en.wikipedia.org/w/api.php?action=query&prop=revisions&pageids=
>> 27697087&rvprop=timestamp|user|comment|content
>>
>>
>> Tobias
>>
>>
>> -----Ursprüngliche Nachricht-----
>> Von: Karl Wright [mailto:daddy...@gmail.com]
>> Gesendet: Montag, 19. September 2011 11:35
>> An: connectors-user@incubator.apache.org
>> Betreff: Re: Indexing Wikipedia/MediaWiki
>>
>> The API seems to be built around using Titles as document keys, and yet 
>> there is a page ID also, which would probably be better at looking up page 
>> data.  So I have some new questions:
>>
>> (1) How do you form a URL that would take a user to a document?  Does it use 
>> the title, or does it use the page ID?
>> (2) If the URL includes the page ID, is there any way to get metadata 
>> information about the document using the page ID directly?  It probably 
>> wouldn't be the query feature that would do this, btw.
>>
>> Thanks,
>> Karl
>>
>>
>> On Mon, Sep 19, 2011 at 5:09 AM, Wunderlich, Tobias 
>> <tobias.wunderl...@igd-r.fraunhofer.de> wrote:
>>> Hey Karl,
>>>
>>> I did some research and the WikiMedia-API looks promising:
>>>
>>> - There needs to be some notion of an overall list of pages:
>>>        - http://www.mediawiki.org/wiki/API:Allpages
>>>        - Example:
>>> http://en.wikipedia.org/w/api.php?action=query&list=allpages&apfrom=K
>>> r
>>> e&aplimit=5
>>>
>>> - Metadata information (author and pub date) also needs to be separated out 
>>> in some way:
>>>        -
>>> http://www.mediawiki.org/wiki/API:Properties#Revisions:_Example
>>>        - Example:
>>> http://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=
>>> A
>>> PI|Main%20Page&rvprop=timestamp|user|comment|content
>>>
>>> What do you think?
>>>
>>> Tobias
>>>
>>>
>>>
>>> -----Ursprüngliche Nachricht-----
>>> Von: Karl Wright [mailto:daddy...@gmail.com]
>>> Gesendet: Freitag, 16. September 2011 16:11
>>> An: Sumana Harihareswara
>>> Cc: Wunderlich, Tobias
>>> Betreff: Re: MediaWiki & Lucene development
>>>
>>> The lucene-search extension may or may not be appropriate for Tobias.
>>> But my interest would extend towards getting wiki content into whatever 
>>> target a ManifoldCF sets up, not just Solr/Lucene.  In order to do this the 
>>> following needs to be addressed:
>>>
>>> - There needs to be some notion of an overall list of pages,
>>> preferably queryable by date and time of last change;
>>> - We'd need access, per page, to authorization information
>>> - Metadata information (author and pub date) also needs to be
>>> separated out in some way
>>>
>>> The plugin that Tobias mentioned seems to do the last item fine, but not 
>>> the first two.  Do you have a solution for those?
>>>
>>> Thanks,
>>> Karl
>>>
>>> On Fri, Sep 16, 2011 at 9:40 AM, Sumana Harihareswara 
>>> <suma...@wikimedia.org> wrote:
>>>> Hi.  I happened to see you both discussing MediaWiki and
>>>> search/indexing in a mailing list recently.
>>>>
>>>> You might be interested in asking your question to the
>>>> MediaWiki/Wikimedia developers' list
>>>>
>>>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>>>>
>>>> and I'd also welcome any assistance in improving our Lucene search
>>>> extension, which is used on Wikipedia:
>>>>
>>>> http://www.mediawiki.org/wiki/Extension:Lucene-search
>>>>
>>>> Thanks!
>>>>
>>>> --
>>>> Sumana Harihareswara
>>>> Volunteer Development Coordinator
>>>> Wikimedia Foundation
>>>>
>>>
>>
>

Reply via email to