About the pagination part, I did some investigation and found that openapi doesn't have spec about streaming responses, but it's actually implementation detail. There are several ways to implement json streaming <https://en.wikipedia.org/wiki/JSON_streaming> , and also there is an rfc <https://www.rfc-editor.org/rfc/rfc7464.html> for json-seq mime type, but it seems not supported widely.
In practice kubernetes has used streaming to implement watch api <https://kubernetes.io/docs/reference/using-api/api-concepts/#streaming-lists> . I think we can choose one of the approaches to avoid pagination. Also it's worth noticing that the http client needs to support http streaming, but I think it would not be a block nowadays. On Thu, Dec 14, 2023 at 8:39 AM Jack Ye <yezhao...@gmail.com> wrote: > Seems like that track has expired (This Internet-Draft will expire on 13 > May 2022), not sure how these RFCs are managed, but it does not seem > hopeful to have this verb in. I think people are mostly using POST for this > use case already. > > But overall I think we are in agreement with the general direction. A few > detail discussions: > > *Distinguish planning using shard or not* > Maybe we should add a query parameter like *distributed=true* to > distinguish your first and third case, since they are now sharing the same > signature. If the requester wants to use distributed planning, then some > sharding strategy is provided as a response for the requester to send more > detailed requests. > > *Necessity of scan ID* > In this approach, is scan ID still required? Because the shard payload > already fully describes the information to retrieve, it seems like we can > just drop the *scan-id* query parameter in the second case. Seems like > it's kept for the case if we still want to persist some state, but it seems > like we can make a stateless style fully working. > > *Shape of shard payload* > What do you think is necessary information of the shard payload? It seems > like we need at least the location of the manifests, plus the delete > manifests or delete files associated with the manifests. I like the idea of > making it a "shard task" that is similar to a file scan task, and it might > allow us to return a mixture of both types of tasks, so we can have better > control of the response size. > > -Jack > > On Wed, Dec 13, 2023 at 3:50 PM Ryan Blue <b...@tabular.io> wrote: > >> I just changed it to POST after looking into support for the QUERY >> method. It's a new HTTP method for cases like this where you don't want to >> pass everything through query params. Here's the QUERY method RFC >> <https://www.ietf.org/archive/id/draft-ietf-httpbis-safe-method-w-body-02.html>, >> but I guess it isn't finalized yet? >> >> Just read them like you would a POST request that doesn't actually create >> anything. >> >> On Wed, Dec 13, 2023 at 3:45 PM Jack Ye <yezhao...@gmail.com> wrote: >> >>> Thanks, the Gist explains a lot of things. This is actually very close >>> to our way of implementing the shard ID, we were defining the shard ID as a >>> string, and the string content is actually something similar to the >>> information of the JSON payload you showed, so we can persist minimum >>> information in storage. >>> >>> Just one clarification needed for your Gist: >>> >>> > QUERY /v1/namespaces/ns/tables/t/scans?scan-id=1 >>> > { "shard": { "id": 1, "manifests": ["C"] }, "filter": {"type": "in", >>> "term": "x", "values": [1, 2, 3] } } >>> > >>> > { "file-scan-tasks": [...] } >>> >>> Here, what does this QUERY verb mean? Is that a GET? If it's GET, we >>> cannot have a request body. That's actually why we expressed that as an ID >>> string, since we can put it as a query parameter. >>> >>> -Jack >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> On Wed, Dec 13, 2023 at 3:25 PM Ryan Blue <b...@tabular.io> wrote: >>> >>>> Jack, >>>> >>>> It sounds like what I’m proposing isn’t quite clear because your >>>> initial response was arguing for a sharding capability. I agree that >>>> sharding is a good idea. I’m less confident about two points: >>>> >>>> 1. Requiring that the service is stateful. As Renjie pointed out, >>>> that makes it harder to scale the service. >>>> 2. The need for both pagination *and* sharding as separate things >>>> >>>> And I also think that Fokko has a good point about trying to keep >>>> things simple and not requiring the CreateScan endpoint. >>>> >>>> For the first point, I’m proposing that we still have a CreateScan >>>> endpoint, but instead of sending only a list of shard IDs it can also send >>>> either a standard shard “task” or an optional JSON definition. Let’s assume >>>> we can send arbitrary JSON for an example. Say I have a table with 4 >>>> manifests, A through D and that C and D match some query filter. When >>>> I call the CreateScan endpoint, the service would send back tasks with >>>> that information: {"id": 1, "manifests": ["C"]}, {"id": 2, >>>> "manifests": ["D"]}. By sending what the shards mean (the manifests to >>>> read), my service can be stateless: any node can get a request for shard 1, >>>> read manifest C, and send back the resulting data files. >>>> >>>> I don’t see much of an argument against doing this *in principle*. It >>>> gives you the flexibility to store state if you choose or to send state to >>>> the client for it to pass back when calling the GetTasks endpoint. >>>> There is a practical problem, which is that it’s annoying to send a GET >>>> request with a JSON payload because you can’t send a request body. It’s >>>> probably obvious, but I’m also not a REST purist so I’d be fine using POST >>>> or QUERY for this. It would look something like this Gist >>>> <https://gist.github.com/rdblue/d2b65bd2ad20f85ee9d04ccf19ac8aba>. >>>> >>>> In your last reply, you also asked whether a stateless service is a >>>> goal. I don’t think that it is, but if we can make simple changes to the >>>> spec to allow more flexibility on the server side, I think that’s a good >>>> direction. You also asked about a reference implementation and I consider >>>> CatalogHandlers >>>> <https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/rest/CatalogHandlers.java> >>>> to be that reference. It does everything except for the work done by your >>>> choice of web application framework. It isn’t stateless, but it only relies >>>> on a Catalog implementation for persistence. >>>> >>>> For the second point, I don’t understand why we need both sharding and >>>> pagination. That is, if we have a protocol that allows sharding, why is >>>> pagination also needed? From my naive perspective on how sharding would >>>> work, we should be able to use metadata from the manifest list to limit the >>>> potential number of data files in a given shard. As long as we can limit >>>> the size of a shard to produce more, pagination seems like unnecessary >>>> complication. >>>> >>>> Lastly, for Fokko’s point, I think another easy extension to the >>>> proposal is to support a direct call to GetTasks. There’s a trade-off >>>> here, but if you’re already sending the original filter along with the >>>> request (in order to filter records from manifest C for instance) then >>>> the request is already something the protocol can express. There’s an >>>> objection concerning resource consumption on the service and creating >>>> responses that are too large or take too long, but we can get around that >>>> by responding with a code that instructs the client to use the >>>> CreateScan API like 413 (Payload too large). I think that would allow >>>> simple clients to function for all but really large tables. The gist above >>>> also shows what this might look like. >>>> >>>> Ryan >>>> >>>> On Wed, Dec 13, 2023 at 11:53 AM Jack Ye <yezhao...@gmail.com> wrote: >>>> >>>>> The current proposal definitely makes the server stateful. In our >>>>> prototype we used other components like DynamoDB to keep track of states. >>>>> If keeping it stateless is a tenant we can definitely make the proposal >>>>> closer to that direction. Maybe one thing to make sure is, is this a core >>>>> tenant of the REST spec? Today we do not even have an official reference >>>>> implementation of the REST server, I feel it is hard to say what are the >>>>> core tenants. Maybe we should create one? >>>>> >>>>> Pagination is a common issue in the REST spec. We also see similar >>>>> limitations with other APIs like GetTables, GetNamespaces. When a catalog >>>>> has many namespaces and tables it suffers from the same issue. It is also >>>>> not ideal for use cases like web browsers, since typically you display a >>>>> small page of results and do not need the full list immediately. So I feel >>>>> we cannot really avoid some state to be kept for those use cases. >>>>> >>>>> Chunked response might be a good way to work around it. We also >>>>> thought about using HTTP2. However, these options seem to be not very >>>>> compatible with OpenAPI. We can do some further research in this domain, >>>>> would really appreciate it if anyone has more insights and experience with >>>>> OpenAPI that can provide some suggestions. >>>>> >>>>> -Jack >>>>> >>>>> >>>>> >>>>> On Tue, Dec 12, 2023 at 6:21 PM Renjie Liu <liurenjie2...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi, Rahi and Jack: >>>>>> >>>>>> Thanks for raising this. >>>>>> >>>>>> My question is that the pagination and sharding will make the rest >>>>>> server stateful, e.g. a sequence of calls is required to go to the same >>>>>> server. In this case, how do we ensure the scalability of the rest >>>>>> server? >>>>>> >>>>>> >>>>>> On Wed, Dec 13, 2023 at 4:09 AM Fokko Driesprong <fo...@apache.org> >>>>>> wrote: >>>>>> >>>>>>> Hey Rahil and Jack, >>>>>>> >>>>>>> Thanks for bringing this up. Ryan and I also discussed this briefly >>>>>>> in the early days of PyIceberg and it would have helped a lot in the >>>>>>> speed >>>>>>> of development. We went for the traditional approach because that would >>>>>>> also support all the other catalogs, but now that the REST catalog is >>>>>>> taking off, I think it still makes a lot of sense to get it in. >>>>>>> >>>>>>> I do share the concern raised Ryan around the concepts of shards and >>>>>>> pagination. For PyIceberg (but also for Go, Rust, and DuckDB) that are >>>>>>> living in a single process today the concept of shards doesn't add >>>>>>> value. I >>>>>>> see your concern with long-running jobs, but for the non-distributed >>>>>>> cases, >>>>>>> it will add additional complexity. >>>>>>> >>>>>>> Some suggestions that come to mind: >>>>>>> >>>>>>> - Stream the tasks directly back using a chunked >>>>>>> response, reducing the latency to the first task. This would also >>>>>>> solve >>>>>>> things with the pagination. The only downside I can think of is >>>>>>> having >>>>>>> delete files where you first need to make sure there are deletes >>>>>>> relevant >>>>>>> to the task, this might increase latency to the first task. >>>>>>> - Making the sharding optional. If you want to shard you call >>>>>>> the CreateScan first and then call the GetScanTask with the IDs. If >>>>>>> you >>>>>>> don't want to shard, you omit the shard parameter and fetch the tasks >>>>>>> directly (here we need also replace the scan string with the full >>>>>>> column/expression/snapshot-id etc). >>>>>>> >>>>>>> Looking forward to discussing this tomorrow in the community sync >>>>>>> <https://iceberg.apache.org/community/#iceberg-community-events>! >>>>>>> >>>>>>> Kind regards, >>>>>>> Fokko >>>>>>> >>>>>>> >>>>>>> >>>>>>> Op ma 11 dec 2023 om 19:05 schreef Jack Ye <yezhao...@gmail.com>: >>>>>>> >>>>>>>> Hi Ryan, thanks for the feedback! >>>>>>>> >>>>>>>> I was a part of this design discussion internally and can provide >>>>>>>> more details. One reason for separating the CreateScan operation was to >>>>>>>> make the API asynchronous and thus keep HTTP communications short. >>>>>>>> Consider >>>>>>>> the case where we only have GetScanTasks API, and there is no shard >>>>>>>> specified. It might take tens of seconds, or even minutes to read >>>>>>>> through >>>>>>>> all the manifest list and manifests before being able to return >>>>>>>> anything. >>>>>>>> This means the HTTP connection has to remain open during that period, >>>>>>>> which >>>>>>>> is not really a good practice in general (consider connection failure, >>>>>>>> load >>>>>>>> balancer and proxy load, etc.). And when we shift the API to >>>>>>>> asynchronous, >>>>>>>> it basically becomes something like the proposal, where a stateful ID >>>>>>>> is >>>>>>>> generated to be able to immediately return back to the client, and the >>>>>>>> client get results by referencing the ID. So in our current prototype >>>>>>>> implementation we are actually keeping this ID and the whole REST >>>>>>>> service >>>>>>>> is stateful. >>>>>>>> >>>>>>>> There were some thoughts we had about the possibility to define a >>>>>>>> "shard ID generator" protocol: basically the client agrees with the >>>>>>>> service >>>>>>>> a way to deterministically generate shard IDs, and service uses it to >>>>>>>> create shards. That sounds like what you are suggesting here, and it >>>>>>>> pushes >>>>>>>> the responsibility to the client side to determine the parallelism. >>>>>>>> But in >>>>>>>> some bad cases (e.g. there are many delete files and we need to read >>>>>>>> all >>>>>>>> those in each shard to apply filters), it seems like there might still >>>>>>>> be >>>>>>>> the long open connection issue above. What is your thought on that? >>>>>>>> >>>>>>>> -Jack >>>>>>>> >>>>>>>> On Sun, Dec 10, 2023 at 10:27 AM Ryan Blue <b...@tabular.io> wrote: >>>>>>>> >>>>>>>>> Rahil, thanks for working on this. It has some really good ideas >>>>>>>>> that we hadn't considered before like a way for the service to plan >>>>>>>>> how to >>>>>>>>> break up the work of scan planning. I really like that idea because it >>>>>>>>> makes it much easier for the service to keep memory consumption low >>>>>>>>> across >>>>>>>>> requests. >>>>>>>>> >>>>>>>>> My primary feedback is that I think it's a little too complicated >>>>>>>>> (with both sharding and pagination) and could be modified slightly so >>>>>>>>> that >>>>>>>>> the service doesn't need to be stateful. If the service isn't >>>>>>>>> necessarily >>>>>>>>> stateful then it should be easier to build implementations. >>>>>>>>> >>>>>>>>> To make it possible for the service to be stateless, I'm proposing >>>>>>>>> that rather than creating shard IDs that are tracked by the service, >>>>>>>>> the >>>>>>>>> information for a shard can be sent to the client. My assumption here >>>>>>>>> is >>>>>>>>> that most implementations would create shards by reading the manifest >>>>>>>>> list, >>>>>>>>> filtering on partition ranges, and creating a shard for some >>>>>>>>> reasonable >>>>>>>>> size of manifest content. For example, if a table has 100MB of >>>>>>>>> metadata in >>>>>>>>> 25 manifests that are about 4 MB each, then it might create 9 shards >>>>>>>>> with >>>>>>>>> 1-4 manifests each. The service could send those shards to the client >>>>>>>>> as a >>>>>>>>> list of manifests to read and the client could send the shard >>>>>>>>> information >>>>>>>>> back to the service to get the data files in each shard (along with >>>>>>>>> the >>>>>>>>> original filter). >>>>>>>>> >>>>>>>>> There's a slight trade-off that the protocol needs to define how >>>>>>>>> to break the work into shards. I'm interested in hearing if that >>>>>>>>> would work >>>>>>>>> with how you were planning on building the service on your end. >>>>>>>>> Another >>>>>>>>> option is to let the service send back arbitrary JSON that would get >>>>>>>>> returned for each shard. Either way, I like that this would make it >>>>>>>>> so the >>>>>>>>> service doesn't need to persist anything. We could also make it so >>>>>>>>> that >>>>>>>>> small tables don't require multiple requests. For example, a client >>>>>>>>> could >>>>>>>>> call the route to get file tasks with just a filter. >>>>>>>>> >>>>>>>>> What do you think? >>>>>>>>> >>>>>>>>> Ryan >>>>>>>>> >>>>>>>>> On Fri, Dec 8, 2023 at 10:41 AM Chertara, Rahil >>>>>>>>> <rcher...@amazon.com.invalid> wrote: >>>>>>>>> >>>>>>>>>> Hi all, >>>>>>>>>> >>>>>>>>>> My name is Rahil Chertara, and I’m a part of the Iceberg team at >>>>>>>>>> Amazon EMR and Athena. I’m reaching out to share a proposal for a >>>>>>>>>> new Scan >>>>>>>>>> API that will be utilized by the RESTCatalog. The process for table >>>>>>>>>> scan >>>>>>>>>> planning is currently done within client engines such as Apache >>>>>>>>>> Spark. By >>>>>>>>>> moving scan functionality to the RESTCatalog, we can integrate >>>>>>>>>> Iceberg >>>>>>>>>> table scans with external services, which can lead to several >>>>>>>>>> benefits. >>>>>>>>>> >>>>>>>>>> For example, we can leverage caching and indexes on the server >>>>>>>>>> side to improve planning performance. Furthermore, by moving this >>>>>>>>>> scan >>>>>>>>>> logic to the RESTCatalog, non-JVM engines can integrate more easily. >>>>>>>>>> This >>>>>>>>>> all can be found in the detailed proposal below. Feel free to >>>>>>>>>> comment, and >>>>>>>>>> add your suggestions . >>>>>>>>>> >>>>>>>>>> Detailed proposal: >>>>>>>>>> https://docs.google.com/document/d/1FdjCnFZM1fNtgyb9-v9fU4FwOX4An-pqEwSaJe8RgUg/edit#heading=h.cftjlkb2wh4h >>>>>>>>>> >>>>>>>>>> Github POC: https://github.com/apache/iceberg/pull/9252 >>>>>>>>>> >>>>>>>>>> Regards, >>>>>>>>>> >>>>>>>>>> Rahil Chertara >>>>>>>>>> Amazon EMR & Athena >>>>>>>>>> rcher...@amazon.com >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Ryan Blue >>>>>>>>> Tabular >>>>>>>>> >>>>>>>> >>>> >>>> -- >>>> Ryan Blue >>>> Tabular >>>> >>> >> >> -- >> Ryan Blue >> Tabular >> >