On Tue, May 1, 2012 at 10:25 PM, Brad Jorsch <[email protected]> wrote: > On Mon, Apr 30, 2012 at 11:08:40PM +0200, Petr Onderka wrote: >> >> In other words, if I find some results for some page on API page n and >> no results on API page n+1, can I be sure there will be no results on >> pages > n? > > Not necessarily. In most cases that assumption should be true, but I see > a few cases offhand where it wouldn't be: > > * If you're using prop=revisions&revids=...&rvprop=content with > revisions big enough that the API response size limit comes into play, > you could wind up in a situation where the initial query returns > revision 1 from page A, the second returns revision 2 from page B, and > the third returns revision 3 from page A again.
Interesting, I didn't know there was a limit for the response size. > * Some modules, such as prop=extlinks, cannot use anything sane for the > continue parameter (or else MySQL blows up), so they just use "offset > into the arbitrarily-ordered set of results". It's possible that edits > made to the wiki between your calls could change the result set so > that values are repeated, skipped, or both. That's exactly what I wanted to know, thanks. This means I won't be relying on the order of results. Too bad this module behaves that way. > * If you are using multiple modules, it might be the case that one > goes through the pages in order by page_id while the other goes by > title, or something along those lines. In practice it seems that all > modules that commonly continue will order by the page_id, so the only > way you might run into this is if the API response size limit causes > modules like categoryinfo or imageinfo that usually don't continue to > do so. That wouldn't matter to me, I consider each module separately, because each module has its own lazy collection, even if they are paged together. > I haven't checked any of the prop modules provided by extensions, BTW. > Chances are most of those are well-behaved and order by page_id, but > it's possible some of them may do things differently. > >> I am writing a library to access the API and every collection in my >> library is lazy. >> >> For example, a user requests to know categories of pages in >> Category:Query languages. >> >> When he starts iterating over the result, I execute the query: >> http://en.wikipedia.org/w/api.php?action=query&generator=categorymembers&gcmtitle=Category:Query%20languages&prop=categories >> >> When he then requests to know the categories of the third page in the >> result (Access query language), >> I will return to him the categories from the first query. If he >> requests more, I execute the query: >> http://en.wikipedia.org/w/api.php?action=query&generator=categorymembers&gcmtitle=Category:Query%20languages&prop=categories&clcontinue=494528|All%20pages%20needing%20cleanup > > How do you determine that you should look at "Access query language" > first rather than one of the other pages? I meant that the user could decide he wants to know categories of that page and not the ones before it. Something like (C# code, that's what I'm writing the library in): pages.Where(p => p.title == "Access query language") .Select(p => new { title = p.title, categories = p.categories.ToArray()}) .ToArray() where `pages` represents the result of the API call. This specific code wouldn't make much sense, but I can imagine wanting to filter the results by something the API won't let you. For example, if you wanted to know categories of pages that are both in Category:Foo and Category:Bar. > In my bot code, I have something that behaves similarly: you give it a > query, and it gives back a series of result pages. But my version will > process clcontinue all the way to the end right away; the laziness is > only in handling gcmcontinue. That way I can be sure that the page nodes > returned by successive calls will have all the necessary data without > worrying about the ordering of the prop module results. Thanks for your response, this really helped me. Petr Onderka [[en:User:Svick]] _______________________________________________ Mediawiki-api mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
