> You can parallelize with opaque tokens by sending a starting point for the next request.
I meant we would have to wait for the server to return this starting point from the past request? With start/limit each client can query for own's chunk without coordination. On Tue, Dec 19, 2023 at 9:44 AM Ryan Blue <b...@tabular.io> wrote: > > I think start and offset has the advantage of being parallelizable (as > compared to continuation tokens). > > You can parallelize with opaque tokens by sending a starting point for the > next request. > > > On the other hand, using "asOf" can be complex to implement and may be > too powerful for the pagination use case > > I don't think that we want to add `asOf`. If the service chooses to do > this, it would send a continuation token that has the information embedded. > > On Tue, Dec 19, 2023 at 9:42 AM Walaa Eldin Moustafa < > wa.moust...@gmail.com> wrote: > >> Can we assume it is the responsibility of the server to ensure >> determinism (e.g., by caching the results along with query ID)? I think >> start and offset has the advantage of being parallelizable (as compared to >> continuation tokens). On the other hand, using "asOf" can be complex to >> implement and may be too powerful for the pagination use case (because it >> allows to query the warehouse as of any point of time, not just now). >> >> Thanks, >> Walaa. >> >> On Tue, Dec 19, 2023 at 9:40 AM Ryan Blue <b...@tabular.io> wrote: >> >>> I think you can solve the atomicity problem with a continuation token >>> and server-side state. In general, I don't think this is a problem we >>> should worry about a lot since pagination commonly has this problem. But >>> since we can build a system that allows you to solve it if you choose to, >>> we should go with that design. >>> >>> On Tue, Dec 19, 2023 at 9:13 AM Micah Kornfield <emkornfi...@gmail.com> >>> wrote: >>> >>>> Hi Jack, >>>> Some answers inline. >>>> >>>> >>>>> In addition to the start index approach, another potential simple way >>>>> to implement the continuation token is to use the last item name, when the >>>>> listing is guaranteed to be in lexicographic order. >>>> >>>> >>>> I think this is one viable implementation, but the reason that the >>>> token should be opaque is that it allows several different implementations >>>> without client side changes. >>>> >>>> For example, if an element is added before the continuation token, then >>>>> all future listing calls with the token would always skip that element. >>>> >>>> >>>> IMO, I think this is fine, for some of the REST APIs it is likely >>>> important to put constraints on atomicity requirements, for others (e.g. >>>> list namespaces) I think it is OK to have looser requirements. >>>> >>>> If we want to enforce that level of atomicity, we probably want to >>>>> introduce another time travel query parameter (e.g. asOf=1703003028000) to >>>>> ensure that we are listing results at a specific point of time of the >>>>> warehouse, so the complete result list is fixed. >>>> >>>> >>>> Time travel might be useful in some cases but I think it is orthogonal >>>> to services wishing to have guarantees around atomicity/consistency of >>>> results. If a server wants to ensure that results are atomic/consistent as >>>> of the start of the listing, it can embed the necessary timestamp in the >>>> token it returns and parse it out when fetching the next result. >>>> >>>> I think this does raise a more general point around service definition >>>> evolution in general. I think there likely need to be metadata endpoints >>>> that expose either: >>>> 1. A version of the REST API supported. >>>> 2. Features the API supports (e.g. which query parameters are honored >>>> for a specific endpoint). >>>> >>>> There are pros and cons to both approaches (apologies if I missed this >>>> in the spec or if it has already been discussed). >>>> >>>> Cheers, >>>> Micah >>>> >>>> >>>> >>>> >>>> >>>> >>>> On Tue, Dec 19, 2023 at 8:25 AM Jack Ye <yezhao...@gmail.com> wrote: >>>> >>>>> Yes I agree that it is better to not enforce the implementation to >>>>> favor any direction, and continuation token is probably better than >>>>> enforcing a numeric start index. >>>>> >>>>> In addition to the start index approach, another potential simple way >>>>> to implement the continuation token is to use the last item name, when the >>>>> listing is guaranteed to be in lexicographic order. Compared to the start >>>>> index approach, it does not need to worry about the change of start index >>>>> when something in the list is added or removed. >>>>> >>>>> However, the issue of concurrent modification could still exist even >>>>> with a continuation token. For example, if an element is added before the >>>>> continuation token, then all future listing calls with the token would >>>>> always skip that element. If we want to enforce that level of atomicity, >>>>> we >>>>> probably want to introduce another time travel query parameter (e.g. >>>>> asOf=1703003028000) to ensure that we are listing results at a specific >>>>> point of time of the warehouse, so the complete result list is fixed. >>>>> (This >>>>> is also the missing piece I forgot to mention in the start index approach >>>>> to ensure it works in distributed settings) >>>>> >>>>> -Jack >>>>> >>>>> On Tue, Dec 19, 2023, 9:51 AM Micah Kornfield <emkornfi...@gmail.com> >>>>> wrote: >>>>> >>>>>> I tried to cover these in more details at: >>>>>> https://docs.google.com/document/d/1bbfoLssY1szCO_Hm3_93ZcN0UAMpf7kjmpwHQngqQJ0/edit >>>>>> >>>>>> On Sun, Dec 17, 2023 at 6:07 PM Renjie Liu <liurenjie2...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> +1 for this approach. I agree that the streaming approach requires >>>>>>> that http client and servers have http 2 streaming support, which is not >>>>>>> compatible with old clients. >>>>>>> >>>>>>> I share the same concern with Micah that only start/limit may not be >>>>>>> enough in a distributed environment where modification happens during >>>>>>> iterations. For compatibility, we need to consider several cases: >>>>>>> >>>>>>> 1. Old client <-> New Server >>>>>>> 2. New client <-> Old server >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Sat, Dec 16, 2023 at 6:51 AM Daniel Weeks <dwe...@apache.org> >>>>>>> wrote: >>>>>>> >>>>>>>> I agree that we want to include this feature and I raised similar >>>>>>>> concerns to what Micah already presented in talking with Ryan. >>>>>>>> >>>>>>>> For backward compatibility, just adding a start and limit implies a >>>>>>>> deterministic order, which is not a current requirement of the REST >>>>>>>> spec. >>>>>>>> >>>>>>>> Also, we need to consider whether the start/limit would need to be >>>>>>>> respected by the server. If existing implementations simply return >>>>>>>> all the >>>>>>>> results, will that be sufficient? There are a few edge cases that >>>>>>>> need to >>>>>>>> be considered here. >>>>>>>> >>>>>>>> For the opaque key approach, I think adding a query param to >>>>>>>> trigger/continue and introducing a continuation token in >>>>>>>> the ListNamespacesResponse might allow for more backward >>>>>>>> compatibility. In >>>>>>>> that scenario, pagination would only take place for clients who know >>>>>>>> how to >>>>>>>> paginate and the ordering would not need to be deterministic. >>>>>>>> >>>>>>>> -Dan >>>>>>>> >>>>>>>> On Fri, Dec 15, 2023, 10:33 AM Micah Kornfield < >>>>>>>> emkornfi...@gmail.com> wrote: >>>>>>>> >>>>>>>>> Just to clarify and add a small suggestion: >>>>>>>>> >>>>>>>>> The behavior with no additional parameters requires the operations >>>>>>>>> to happen as they do today for backwards compatibility (i.e either all >>>>>>>>> responses are returned or a failure occurs). >>>>>>>>> >>>>>>>>> For new parameters, I'd suggest an opaque start token (instead of >>>>>>>>> specific numeric offset) that can be returned by the service and a >>>>>>>>> limit >>>>>>>>> (as proposed above). If a start token is provided without a limit a >>>>>>>>> default limit can be chosen by the server. Servers might return less >>>>>>>>> than >>>>>>>>> limit (i.e. clients are required to check for a next token to >>>>>>>>> determine if >>>>>>>>> iteration is complete). This enables server side state if it is >>>>>>>>> desired >>>>>>>>> but also makes deterministic listing much more feasible (deterministic >>>>>>>>> responses are essentially impossible in the face of changing data if >>>>>>>>> only a >>>>>>>>> start offset is provided). >>>>>>>>> >>>>>>>>> In an ideal world, specifying a limit would result in streaming >>>>>>>>> responses being returned with the last part either containing a token >>>>>>>>> if >>>>>>>>> continuation is necessary. Given conversation on the other thread of >>>>>>>>> streaming, I'd imagine this is quite hard to model in an Open API REST >>>>>>>>> service. >>>>>>>>> >>>>>>>>> Therefore it seems like using pagination with token and offset >>>>>>>>> would be preferred. If skipping someplace in the middle of the >>>>>>>>> namespaces >>>>>>>>> is required then I would suggest modelling those as first class query >>>>>>>>> parameters (e.g. "startAfterNamespace") >>>>>>>>> >>>>>>>>> Cheers, >>>>>>>>> Micah >>>>>>>>> >>>>>>>>> >>>>>>>>> On Fri, Dec 15, 2023 at 10:08 AM Ryan Blue <b...@tabular.io> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> +1 for this approach >>>>>>>>>> >>>>>>>>>> I think it's good to use query params because it can be >>>>>>>>>> backward-compatible with the current behavior. If you get more than >>>>>>>>>> the >>>>>>>>>> limit back, then the service probably doesn't support pagination. >>>>>>>>>> And if a >>>>>>>>>> client doesn't support pagination they get the same results that >>>>>>>>>> they would >>>>>>>>>> today. A streaming approach with a continuation link like in the >>>>>>>>>> scan API >>>>>>>>>> discussion wouldn't work because old clients don't know to make a >>>>>>>>>> second >>>>>>>>>> request. >>>>>>>>>> >>>>>>>>>> On Thu, Dec 14, 2023 at 10:07 AM Jack Ye <yezhao...@gmail.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Hi everyone, >>>>>>>>>>> >>>>>>>>>>> During the conversation of the Scan API for REST spec, we >>>>>>>>>>> touched on the topic of pagination when REST response is large or >>>>>>>>>>> takes >>>>>>>>>>> time to be produced. >>>>>>>>>>> >>>>>>>>>>> I just want to discuss this separately, since we also see the >>>>>>>>>>> issue for ListNamespaces and ListTables/Views, when integrating >>>>>>>>>>> with a >>>>>>>>>>> large organization that has over 100k namespaces, and also a lot of >>>>>>>>>>> tables >>>>>>>>>>> in some namespaces. >>>>>>>>>>> >>>>>>>>>>> Pagination requires either keeping state, or the response to be >>>>>>>>>>> deterministic such that the client can request a range of the full >>>>>>>>>>> response. If we want to avoid keeping state, I think we need to >>>>>>>>>>> allow some >>>>>>>>>>> query parameters like: >>>>>>>>>>> - *start*: the start index of the item in the response >>>>>>>>>>> - *limit*: the number of items to be returned in the response >>>>>>>>>>> >>>>>>>>>>> So we can send a request like: >>>>>>>>>>> >>>>>>>>>>> *GET /namespaces?start=300&limit=100* >>>>>>>>>>> >>>>>>>>>>> *GET /namespaces/ns/tables?start=300&limit=100* >>>>>>>>>>> >>>>>>>>>>> And the REST spec should enforce that the response returned for >>>>>>>>>>> the paginated GET should be deterministic. >>>>>>>>>>> >>>>>>>>>>> Any thoughts on this? >>>>>>>>>>> >>>>>>>>>>> Best, >>>>>>>>>>> Jack Ye >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Ryan Blue >>>>>>>>>> Tabular >>>>>>>>>> >>>>>>>>> >>> >>> -- >>> Ryan Blue >>> Tabular >>> >> > > -- > Ryan Blue > Tabular >