After looking around, it seems like compared to OpenAPI, the AsyncAPI protocol (https://www.asyncapi.com/) could be a better option to describe streaming APIs. That might be one potential option, just put it out here.
-Jack On Wed, Dec 13, 2023 at 11:52 AM Jack Ye <yezhao...@gmail.com> wrote: > The current proposal definitely makes the server stateful. In our > prototype we used other components like DynamoDB to keep track of states. > If keeping it stateless is a tenant we can definitely make the proposal > closer to that direction. Maybe one thing to make sure is, is this a core > tenant of the REST spec? Today we do not even have an official reference > implementation of the REST server, I feel it is hard to say what are the > core tenants. Maybe we should create one? > > Pagination is a common issue in the REST spec. We also see similar > limitations with other APIs like GetTables, GetNamespaces. When a catalog > has many namespaces and tables it suffers from the same issue. It is also > not ideal for use cases like web browsers, since typically you display a > small page of results and do not need the full list immediately. So I feel > we cannot really avoid some state to be kept for those use cases. > > Chunked response might be a good way to work around it. We also thought > about using HTTP2. However, these options seem to be not very compatible > with OpenAPI. We can do some further research in this domain, would really > appreciate it if anyone has more insights and experience with OpenAPI that > can provide some suggestions. > > -Jack > > > > On Tue, Dec 12, 2023 at 6:21 PM Renjie Liu <liurenjie2...@gmail.com> > wrote: > >> Hi, Rahi and Jack: >> >> Thanks for raising this. >> >> My question is that the pagination and sharding will make the rest server >> stateful, e.g. a sequence of calls is required to go to the same server. In >> this case, how do we ensure the scalability of the rest server? >> >> >> On Wed, Dec 13, 2023 at 4:09 AM Fokko Driesprong <fo...@apache.org> >> wrote: >> >>> Hey Rahil and Jack, >>> >>> Thanks for bringing this up. Ryan and I also discussed this briefly in >>> the early days of PyIceberg and it would have helped a lot in the speed of >>> development. We went for the traditional approach because that would also >>> support all the other catalogs, but now that the REST catalog is taking >>> off, I think it still makes a lot of sense to get it in. >>> >>> I do share the concern raised Ryan around the concepts of shards and >>> pagination. For PyIceberg (but also for Go, Rust, and DuckDB) that are >>> living in a single process today the concept of shards doesn't add value. I >>> see your concern with long-running jobs, but for the non-distributed cases, >>> it will add additional complexity. >>> >>> Some suggestions that come to mind: >>> >>> - Stream the tasks directly back using a chunked response, reducing >>> the latency to the first task. This would also solve things with the >>> pagination. The only downside I can think of is having delete files where >>> you first need to make sure there are deletes relevant to the task, this >>> might increase latency to the first task. >>> - Making the sharding optional. If you want to shard you call the >>> CreateScan first and then call the GetScanTask with the IDs. If you don't >>> want to shard, you omit the shard parameter and fetch the tasks directly >>> (here we need also replace the scan string with the full >>> column/expression/snapshot-id etc). >>> >>> Looking forward to discussing this tomorrow in the community sync >>> <https://iceberg.apache.org/community/#iceberg-community-events>! >>> >>> Kind regards, >>> Fokko >>> >>> >>> >>> Op ma 11 dec 2023 om 19:05 schreef Jack Ye <yezhao...@gmail.com>: >>> >>>> Hi Ryan, thanks for the feedback! >>>> >>>> I was a part of this design discussion internally and can provide more >>>> details. One reason for separating the CreateScan operation was to make the >>>> API asynchronous and thus keep HTTP communications short. Consider the case >>>> where we only have GetScanTasks API, and there is no shard specified. It >>>> might take tens of seconds, or even minutes to read through all the >>>> manifest list and manifests before being able to return anything. This >>>> means the HTTP connection has to remain open during that period, which is >>>> not really a good practice in general (consider connection failure, load >>>> balancer and proxy load, etc.). And when we shift the API to asynchronous, >>>> it basically becomes something like the proposal, where a stateful ID is >>>> generated to be able to immediately return back to the client, and the >>>> client get results by referencing the ID. So in our current prototype >>>> implementation we are actually keeping this ID and the whole REST service >>>> is stateful. >>>> >>>> There were some thoughts we had about the possibility to define a >>>> "shard ID generator" protocol: basically the client agrees with the service >>>> a way to deterministically generate shard IDs, and service uses it to >>>> create shards. That sounds like what you are suggesting here, and it pushes >>>> the responsibility to the client side to determine the parallelism. But in >>>> some bad cases (e.g. there are many delete files and we need to read all >>>> those in each shard to apply filters), it seems like there might still be >>>> the long open connection issue above. What is your thought on that? >>>> >>>> -Jack >>>> >>>> On Sun, Dec 10, 2023 at 10:27 AM Ryan Blue <b...@tabular.io> wrote: >>>> >>>>> Rahil, thanks for working on this. It has some really good ideas that >>>>> we hadn't considered before like a way for the service to plan how to >>>>> break >>>>> up the work of scan planning. I really like that idea because it makes it >>>>> much easier for the service to keep memory consumption low across >>>>> requests. >>>>> >>>>> My primary feedback is that I think it's a little too complicated >>>>> (with both sharding and pagination) and could be modified slightly so that >>>>> the service doesn't need to be stateful. If the service isn't necessarily >>>>> stateful then it should be easier to build implementations. >>>>> >>>>> To make it possible for the service to be stateless, I'm proposing >>>>> that rather than creating shard IDs that are tracked by the service, the >>>>> information for a shard can be sent to the client. My assumption here is >>>>> that most implementations would create shards by reading the manifest >>>>> list, >>>>> filtering on partition ranges, and creating a shard for some reasonable >>>>> size of manifest content. For example, if a table has 100MB of metadata in >>>>> 25 manifests that are about 4 MB each, then it might create 9 shards with >>>>> 1-4 manifests each. The service could send those shards to the client as a >>>>> list of manifests to read and the client could send the shard information >>>>> back to the service to get the data files in each shard (along with the >>>>> original filter). >>>>> >>>>> There's a slight trade-off that the protocol needs to define how to >>>>> break the work into shards. I'm interested in hearing if that would work >>>>> with how you were planning on building the service on your end. Another >>>>> option is to let the service send back arbitrary JSON that would get >>>>> returned for each shard. Either way, I like that this would make it so the >>>>> service doesn't need to persist anything. We could also make it so that >>>>> small tables don't require multiple requests. For example, a client could >>>>> call the route to get file tasks with just a filter. >>>>> >>>>> What do you think? >>>>> >>>>> Ryan >>>>> >>>>> On Fri, Dec 8, 2023 at 10:41 AM Chertara, Rahil >>>>> <rcher...@amazon.com.invalid> wrote: >>>>> >>>>>> Hi all, >>>>>> >>>>>> My name is Rahil Chertara, and I’m a part of the Iceberg team at >>>>>> Amazon EMR and Athena. I’m reaching out to share a proposal for a new >>>>>> Scan >>>>>> API that will be utilized by the RESTCatalog. The process for table scan >>>>>> planning is currently done within client engines such as Apache Spark. By >>>>>> moving scan functionality to the RESTCatalog, we can integrate Iceberg >>>>>> table scans with external services, which can lead to several benefits. >>>>>> >>>>>> For example, we can leverage caching and indexes on the server side >>>>>> to improve planning performance. Furthermore, by moving this scan logic >>>>>> to >>>>>> the RESTCatalog, non-JVM engines can integrate more easily. This all can >>>>>> be >>>>>> found in the detailed proposal below. Feel free to comment, and add your >>>>>> suggestions . >>>>>> >>>>>> Detailed proposal: >>>>>> https://docs.google.com/document/d/1FdjCnFZM1fNtgyb9-v9fU4FwOX4An-pqEwSaJe8RgUg/edit#heading=h.cftjlkb2wh4h >>>>>> >>>>>> Github POC: https://github.com/apache/iceberg/pull/9252 >>>>>> >>>>>> Regards, >>>>>> >>>>>> Rahil Chertara >>>>>> Amazon EMR & Athena >>>>>> rcher...@amazon.com >>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> Ryan Blue >>>>> Tabular >>>>> >>>>