Re: [Mediawiki-api] Migrating to "dumb query-continue"

Yuri Astrakhan Tue, 18 Dec 2012 07:04:04 -0800

Assume wiki has pages A and B with links and
categories: A(l1,l2,l3,l4,l5,c1,c2,c3), B(l1,c1).  This is how API behaves
now:


1 req)  prop=categories|links & generator=allpages & gaplimit=1 & pllimit=2
& cllimit=2
1 res)  A(l1,l2,c1,c2), gapcontinue=B, plcontinue=l3, clcontinue=c3

client ignores gapcontinue because there are others, and adds pl & cl
continues:
2 req)  initial & plcontinue=l3 & clcontinue=c3
2 res)  A(l3,l4,c3), gapcontinue=B, plcontinue=l5

this is where a *potential" for the bug is: client must understand that
since there is no more clcontinue, but there is plcontinue, there are no
more categories in this set of pages, so it should not ask for
prop=categories until it finishes with plcontinue. Once done, it should
resume prop=categories and also add gapcontinue=B.

3 bad req)  initial & plcontinue=l5
3 bad res)  A(l5,c1,c2), gapcontinue=B, clcontinue=c3

3 good req)  initial but with prop=links only & plcontinue=l5
 3 good res)  A(l5) & gapcontinue=B

4 req) initial & gapcontinue=B
4 res) B(l1,c1)  -- done

I think this puts too much unneeded burden on the client code to handle
these cases correctly. Instead, API should be simplified to return
clcontinue=| in result #2, and results 1 and 2 should have gapcontinue=A.
 Client could simply merge all resulting continue values into following
requests, and greatly simplify all the code for the most common "get
everything I requested" scenario, and hence should be the default behavior:

1 req)  prop=categories|links & generator=allpages & gaplimit=1 & pllimit=2
& cllimit=2
1 res)  A(l1,l2,c1,c2), gapcontinue=, plcontinue=l3, clcontinue=c3

2 req)  initial & gapcontinue= & plcontinue=l3 & clcontinue=c3
2 res)  A(l3,l4,c3), gapcontinue=, plcontinue=l5, clcontinue=|

3 req)  initial & gapcontinue= & plcontinue=l5 & clcontinue=|
3 res)  A(l5) & gapcontinue=B, plcontinue=, clcontinue=

4 req) initial & gapcontinue=B & plcontinue= & clcontinue=
4 res) B(l1,c1)  -- no continue section, done


That would be quite a change. It would mean the API wouldn't return
> gapcontinue at all until plcontinue and clcontinue are both exhausted,
> and then would keep returning the *old* gapcontinue until plcontinue
> and clcontinue are both exhausted again.
>

Correct, API would return an empty gapcontinue until it finishes with the
first set, than it will return the beginning of the next set until that is
exhausted as well, etc.


> This would break some possible use cases which I'm not entirely sure
> we should break. For example, I can imagine a bot that would use
> generator=foo&gfoolimit=1&prop=revisions, follow rvcontinue until it
> finds whichever revision it is looking for, and then ignore rvcontinue
> in favor of gfoocontinue to move on to the next page. With "dumb
> continue", it wouldn't be able to do that.
>


I do not think API should support the case you described with gaplimit=1,
because that fundamentally breaks the original API goal of "get data about
many pages with lots of elements on them in one request". I would prefer
the client do two separate queries: 1) list pages  2) many queries "list
revisions for page X". Having generator with gaplimit=1 does not improve
server performance or minimize traffic.

But even if we do find compelling reasons to include that, for the advanced
scenario "skip subquery and follow on with the generator" it might make
sense to introduce appendable "|next" value keyword gapcontinue=A|next or a
gcommand=skipcurrent parameter. I am not sure it is the cleanest solution,
but it is certainly cleaner than forcing every client out there to have the
complex logic from above for all common cases.

1 req)  prop=categories|links & generator=allpages & gaplimit=1 & pllimit=2
& cllimit=2
1 res)  A(l1,l2,c1,c2), gapcontinue=, plcontinue=l3, clcontinue=c3

client decided it does not need anything else from A, so it adds |next to
gapcontinue. API ignores all other property continues.
2 req)  initial & gapcontinue=|next, plcontinue=l3, clcontinue=c3
2 res)   B(l1,c1) -- done

The client would still have to know how to manipulate
> list=/meta=/generator=/prop=, particularly when using more than one of
> these in the same query. But the rules are simpler, it wouldn't have
> to know that gclcontinue is for generator=categories while clcontinue
> is for prop=categories, and it would be easy to know what exactly to
> include in prop= when continuing to avoid repeated results.
>

Complex client logic is exactly what I am trying to avoid. Ideally all
"continue" values should be joined into a single "query-continue =
magic-value"  of no interesting user-passable properties.


> You can't get away with changing the generator's continue like that
> and still get correct results, because you can't assume the generator
> generates pages in the same order every prop module processes them.
> Nor can you assume each prop module will process pages in the same
> order. For example, many prop modules order by page_id but may be ASC
> or DESC on their "dir" parameter.
>

Totally agree - I forgot about the sub-ordering. So we either keep the same
gapcontinue until the set is exhausted. The key here is that if we do not
let the client manipulate the continue parameters, the server could later
be optimized to return less results if they cannot yet be populated.



> IMO, if a client wants to ensure it has complete results for any page
> objects in the result, it should just process all of the prop
> continuation parameters to completion.
>

The result set might be huge. It wouldn't be nice to have a 12GB x64 only
client lib requirement :)

_______________________________________________
Mediawiki-api mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api

Re: [Mediawiki-api] Migrating to "dumb query-continue"

Reply via email to