Re: Pagination for List APIs in the REST spec

Walaa Eldin Moustafa Tue, 19 Dec 2023 09:47:36 -0800

> You can parallelize with opaque tokens by sending a starting point for
the next request.


I meant we would have to wait for the server to return this starting point
from the past request? With start/limit each client can query for own's
chunk without coordination.

On Tue, Dec 19, 2023 at 9:44 AM Ryan Blue <b...@tabular.io> wrote:

> > I think start and offset has the advantage of being parallelizable (as
> compared to continuation tokens).
>
> You can parallelize with opaque tokens by sending a starting point for the
> next request.
>
> > On the other hand, using "asOf" can be complex to  implement and may be
> too powerful for the pagination use case
>
> I don't think that we want to add `asOf`. If the service chooses to do
> this, it would send a continuation token that has the information embedded.
>
> On Tue, Dec 19, 2023 at 9:42 AM Walaa Eldin Moustafa <
> wa.moust...@gmail.com> wrote:
>
>> Can we assume it is the responsibility of the server to ensure
>> determinism (e.g., by caching the results along with query ID)? I think
>> start and offset has the advantage of being parallelizable (as compared to
>> continuation tokens). On the other hand, using "asOf" can be complex to
>>  implement and may be too powerful for the pagination use case (because it
>> allows to query the warehouse as of any point of time, not just now).
>>
>> Thanks,
>> Walaa.
>>
>> On Tue, Dec 19, 2023 at 9:40 AM Ryan Blue <b...@tabular.io> wrote:
>>
>>> I think you can solve the atomicity problem with a continuation token
>>> and server-side state. In general, I don't think this is a problem we
>>> should worry about a lot since pagination commonly has this problem. But
>>> since we can build a system that allows you to solve it if you choose to,
>>> we should go with that design.
>>>
>>> On Tue, Dec 19, 2023 at 9:13 AM Micah Kornfield <emkornfi...@gmail.com>
>>> wrote:
>>>
>>>> Hi Jack,
>>>> Some answers inline.
>>>>
>>>>
>>>>> In addition to the start index approach, another potential simple way
>>>>> to implement the continuation token is to use the last item name, when the
>>>>> listing is guaranteed to be in lexicographic order.
>>>>
>>>>
>>>> I think this is one viable implementation, but the reason that the
>>>> token should be opaque is that it allows several different implementations
>>>> without client side changes.
>>>>
>>>> For example, if an element is added before the continuation token, then
>>>>> all future listing calls with the token would always skip that element.
>>>>
>>>>
>>>> IMO, I think this is fine, for some of the REST APIs it is likely
>>>> important to put constraints on atomicity requirements, for others (e.g.
>>>> list namespaces) I think it is OK to have looser requirements.
>>>>
>>>> If we want to enforce that level of atomicity, we probably want to
>>>>> introduce another time travel query parameter (e.g. asOf=1703003028000) to
>>>>> ensure that we are listing results at a specific point of time of the
>>>>> warehouse, so the complete result list is fixed.
>>>>
>>>>
>>>> Time travel might be useful in some cases but I think it is orthogonal
>>>> to services wishing to have guarantees around  atomicity/consistency of
>>>> results.  If a server wants to ensure that results are atomic/consistent as
>>>> of the start of the listing, it can embed the necessary timestamp in the
>>>> token it returns and parse it out when fetching the next result.
>>>>
>>>> I think this does raise a more general point around service definition
>>>> evolution in general.  I think there likely need to be metadata endpoints
>>>> that expose either:
>>>> 1.  A version of the REST API supported.
>>>> 2.  Features the API supports (e.g. which query parameters are honored
>>>> for a specific endpoint).
>>>>
>>>> There are pros and cons to both approaches (apologies if I missed this
>>>> in the spec or if it has already been discussed).
>>>>
>>>> Cheers,
>>>> Micah
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Dec 19, 2023 at 8:25 AM Jack Ye <yezhao...@gmail.com> wrote:
>>>>
>>>>> Yes I agree that it is better to not enforce the implementation to
>>>>> favor any direction, and continuation token is probably better than
>>>>> enforcing a numeric start index.
>>>>>
>>>>> In addition to the start index approach, another potential simple way
>>>>> to implement the continuation token is to use the last item name, when the
>>>>> listing is guaranteed to be in lexicographic order. Compared to the start
>>>>> index approach, it does not need to worry about the change of start index
>>>>> when something in the list is added or removed.
>>>>>
>>>>> However, the issue of concurrent modification could still exist even
>>>>> with a continuation token. For example, if an element is added before the
>>>>> continuation token, then all future listing calls with the token would
>>>>> always skip that element. If we want to enforce that level of atomicity, 
>>>>> we
>>>>> probably want to introduce another time travel query parameter (e.g.
>>>>> asOf=1703003028000) to ensure that we are listing results at a specific
>>>>> point of time of the warehouse, so the complete result list is fixed. 
>>>>> (This
>>>>> is also the missing piece I forgot to mention in the start index approach
>>>>> to ensure it works in distributed settings)
>>>>>
>>>>> -Jack
>>>>>
>>>>> On Tue, Dec 19, 2023, 9:51 AM Micah Kornfield <emkornfi...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> I tried to cover these in more details at:
>>>>>> https://docs.google.com/document/d/1bbfoLssY1szCO_Hm3_93ZcN0UAMpf7kjmpwHQngqQJ0/edit
>>>>>>
>>>>>> On Sun, Dec 17, 2023 at 6:07 PM Renjie Liu <liurenjie2...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> +1 for this approach. I agree that the streaming approach requires
>>>>>>> that http client and servers have http 2 streaming support, which is not
>>>>>>> compatible with old clients.
>>>>>>>
>>>>>>> I share the same concern with Micah that only start/limit may not be
>>>>>>> enough in a distributed environment where modification happens during
>>>>>>> iterations. For compatibility, we need to consider several cases:
>>>>>>>
>>>>>>> 1. Old client <-> New Server
>>>>>>> 2. New client <-> Old server
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Sat, Dec 16, 2023 at 6:51 AM Daniel Weeks <dwe...@apache.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I agree that we want to include this feature and I raised similar
>>>>>>>> concerns to what Micah already presented in talking with Ryan.
>>>>>>>>
>>>>>>>> For backward compatibility, just adding a start and limit implies a
>>>>>>>> deterministic order, which is not a current requirement of the REST 
>>>>>>>> spec.
>>>>>>>>
>>>>>>>> Also, we need to consider whether the start/limit would need to be
>>>>>>>> respected by the server.  If existing implementations simply return 
>>>>>>>> all the
>>>>>>>> results, will that be sufficient?  There are a few edge cases that 
>>>>>>>> need to
>>>>>>>> be considered here.
>>>>>>>>
>>>>>>>> For the opaque key approach, I think adding a query param to
>>>>>>>> trigger/continue and introducing a continuation token in
>>>>>>>> the ListNamespacesResponse might allow for more backward 
>>>>>>>> compatibility.  In
>>>>>>>> that scenario, pagination would only take place for clients who know 
>>>>>>>> how to
>>>>>>>> paginate and the ordering would not need to be deterministic.
>>>>>>>>
>>>>>>>> -Dan
>>>>>>>>
>>>>>>>> On Fri, Dec 15, 2023, 10:33 AM Micah Kornfield <
>>>>>>>> emkornfi...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Just to clarify and add a small suggestion:
>>>>>>>>>
>>>>>>>>> The behavior with no additional parameters requires the operations
>>>>>>>>> to happen as they do today for backwards compatibility (i.e either all
>>>>>>>>> responses are returned or a failure occurs).
>>>>>>>>>
>>>>>>>>> For new parameters, I'd suggest an opaque start token (instead of
>>>>>>>>> specific numeric offset) that can be returned by the service and a 
>>>>>>>>> limit
>>>>>>>>> (as proposed above). If a start token is provided without a limit a
>>>>>>>>> default limit can be chosen by the server.  Servers might return less 
>>>>>>>>> than
>>>>>>>>> limit (i.e. clients are required to check for a next token to 
>>>>>>>>> determine if
>>>>>>>>> iteration is complete).  This enables server side state if it is 
>>>>>>>>> desired
>>>>>>>>> but also makes deterministic listing much more feasible (deterministic
>>>>>>>>> responses are essentially impossible in the face of changing data if 
>>>>>>>>> only a
>>>>>>>>> start offset is provided).
>>>>>>>>>
>>>>>>>>> In an ideal world, specifying a limit would result in streaming
>>>>>>>>> responses being returned with the last part either containing a token 
>>>>>>>>> if
>>>>>>>>> continuation is necessary.  Given conversation on the other thread of
>>>>>>>>> streaming, I'd imagine this is quite hard to model in an Open API REST
>>>>>>>>> service.
>>>>>>>>>
>>>>>>>>> Therefore it seems like using pagination with token and offset
>>>>>>>>> would be preferred.  If skipping someplace in the middle of the 
>>>>>>>>> namespaces
>>>>>>>>> is required then I would suggest modelling those as first class query
>>>>>>>>> parameters (e.g. "startAfterNamespace")
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>> Micah
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Dec 15, 2023 at 10:08 AM Ryan Blue <b...@tabular.io>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> +1 for this approach
>>>>>>>>>>
>>>>>>>>>> I think it's good to use query params because it can be
>>>>>>>>>> backward-compatible with the current behavior. If you get more than 
>>>>>>>>>> the
>>>>>>>>>> limit back, then the service probably doesn't support pagination. 
>>>>>>>>>> And if a
>>>>>>>>>> client doesn't support pagination they get the same results that 
>>>>>>>>>> they would
>>>>>>>>>> today. A streaming approach with a continuation link like in the 
>>>>>>>>>> scan API
>>>>>>>>>> discussion wouldn't work because old clients don't know to make a 
>>>>>>>>>> second
>>>>>>>>>> request.
>>>>>>>>>>
>>>>>>>>>> On Thu, Dec 14, 2023 at 10:07 AM Jack Ye <yezhao...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>
>>>>>>>>>>> During the conversation of the Scan API for REST spec, we
>>>>>>>>>>> touched on the topic of pagination when REST response is large or 
>>>>>>>>>>> takes
>>>>>>>>>>> time to be produced.
>>>>>>>>>>>
>>>>>>>>>>> I just want to discuss this separately, since we also see the
>>>>>>>>>>> issue for ListNamespaces and ListTables/Views, when integrating 
>>>>>>>>>>> with a
>>>>>>>>>>> large organization that has over 100k namespaces, and also a lot of 
>>>>>>>>>>> tables
>>>>>>>>>>> in some namespaces.
>>>>>>>>>>>
>>>>>>>>>>> Pagination requires either keeping state, or the response to be
>>>>>>>>>>> deterministic such that the client can request a range of the full
>>>>>>>>>>> response. If we want to avoid keeping state, I think we need to 
>>>>>>>>>>> allow some
>>>>>>>>>>> query parameters like:
>>>>>>>>>>> - *start*: the start index of the item in the response
>>>>>>>>>>> - *limit*: the number of items to be returned in the response
>>>>>>>>>>>
>>>>>>>>>>> So we can send a request like:
>>>>>>>>>>>
>>>>>>>>>>> *GET /namespaces?start=300&limit=100*
>>>>>>>>>>>
>>>>>>>>>>> *GET /namespaces/ns/tables?start=300&limit=100*
>>>>>>>>>>>
>>>>>>>>>>> And the REST spec should enforce that the response returned for
>>>>>>>>>>> the paginated GET should be deterministic.
>>>>>>>>>>>
>>>>>>>>>>> Any thoughts on this?
>>>>>>>>>>>
>>>>>>>>>>> Best,
>>>>>>>>>>> Jack Ye
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Ryan Blue
>>>>>>>>>> Tabular
>>>>>>>>>>
>>>>>>>>>
>>>
>>> --
>>> Ryan Blue
>>> Tabular
>>>
>>
>
> --
> Ryan Blue
> Tabular
>

Re: Pagination for List APIs in the REST spec

Reply via email to