On Tue, May 1, 2012 at 10:25 PM, Brad Jorsch
<[email protected]> wrote:
> On Mon, Apr 30, 2012 at 11:08:40PM +0200, Petr Onderka wrote:
>>
>> In other words, if I find some results for some page on API page n and
>> no results on API page n+1, can I be sure there will be no results on
>> pages > n?
>
> Not necessarily. In most cases that assumption should be true, but I see
> a few cases offhand where it wouldn't be:
>
> * If you're using prop=revisions&revids=...&rvprop=content with
>  revisions big enough that the API response size limit comes into play,
>  you could wind up in a situation where the initial query returns
>  revision 1 from page A, the second returns revision 2 from page B, and
>  the third returns revision 3 from page A again.

Interesting, I didn't know there was a limit for the response size.

> * Some modules, such as prop=extlinks, cannot use anything sane for the
>  continue parameter (or else MySQL blows up), so they just use "offset
>  into the arbitrarily-ordered set of results". It's possible that edits
>  made to the wiki between your calls could change the result set so
>  that values are repeated, skipped, or both.

That's exactly what I wanted to know, thanks. This means I won't be
relying on the order of results.
Too bad this module behaves that way.

> * If you are using multiple modules, it might be the case that one
>  goes through the pages in order by page_id while the other goes by
>  title, or something along those lines. In practice it seems that all
>  modules that commonly continue will order by the page_id, so the only
>  way you might run into this is if the API response size limit causes
>  modules like categoryinfo or imageinfo that usually don't continue to
>  do so.

That wouldn't matter to me, I consider each module separately,
because each module has its own lazy collection,
even if they are paged together.

> I haven't checked any of the prop modules provided by extensions, BTW.
> Chances are most of those are well-behaved and order by page_id, but
> it's possible some of them may do things differently.
>
>> I am writing a library to access the API and every collection in my
>> library is lazy.
>>
>> For example, a user requests to know categories of pages in
>> Category:Query languages.
>>
>> When he starts iterating over the result, I execute the query:
>> http://en.wikipedia.org/w/api.php?action=query&generator=categorymembers&gcmtitle=Category:Query%20languages&prop=categories
>>
>> When he then requests to know the categories of the third page in the
>> result (Access query language),
>> I will return to him the categories from the first query. If he
>> requests more, I execute the query:
>> http://en.wikipedia.org/w/api.php?action=query&generator=categorymembers&gcmtitle=Category:Query%20languages&prop=categories&clcontinue=494528|All%20pages%20needing%20cleanup
>
> How do you determine that you should look at "Access query language"
> first rather than one of the other pages?

I meant that the user could decide he wants to know categories of that
page and not the ones before it.
Something like (C# code, that's what I'm writing the library in):

pages.Where(p => p.title == "Access query language")
    .Select(p => new { title = p.title, categories = p.categories.ToArray()})
    .ToArray()

where `pages` represents the result of the API call.

This specific code wouldn't make much sense, but I can imagine wanting
to filter the results by something the API won't let you.
For example, if you wanted to know categories of pages that are both
in Category:Foo and Category:Bar.

> In my bot code, I have something that behaves similarly: you give it a
> query, and it gives back a series of result pages. But my version will
> process clcontinue all the way to the end right away; the laziness is
> only in handling gcmcontinue. That way I can be sure that the page nodes
> returned by successive calls will have all the necessary data without
> worrying about the ordering of the prop module results.

Thanks for your response, this really helped me.

Petr Onderka
[[en:User:Svick]]

_______________________________________________
Mediawiki-api mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api

Reply via email to