Re: Proposal for REST APIs for Iceberg table scans

Jack Ye Wed, 13 Dec 2023 13:30:00 -0800

After looking around, it seems like compared to OpenAPI, the AsyncAPI
protocol (https://www.asyncapi.com/) could be a better option to describe
streaming APIs. That might be one potential option, just put it out here.


-Jack

On Wed, Dec 13, 2023 at 11:52 AM Jack Ye <yezhao...@gmail.com> wrote:

> The current proposal definitely makes the server stateful. In our
> prototype we used other components like DynamoDB to keep track of states.
> If keeping it stateless is a tenant we can definitely make the proposal
> closer to that direction. Maybe one thing to make sure is, is this a core
> tenant of the REST spec? Today we do not even have an official reference
> implementation of the REST server, I feel it is hard to say what are the
> core tenants. Maybe we should create one?
>
> Pagination is a common issue in the REST spec. We also see similar
> limitations with other APIs like GetTables, GetNamespaces. When a catalog
> has many namespaces and tables it suffers from the same issue. It is also
> not ideal for use cases like web browsers, since typically you display a
> small page of results and do not need the full list immediately. So I feel
> we cannot really avoid some state to be kept for those use cases.
>
> Chunked response might be a good way to work around it. We also thought
> about using HTTP2. However, these options seem to be not very compatible
> with OpenAPI. We can do some further research in this domain, would really
> appreciate it if anyone has more insights and experience with OpenAPI that
> can provide some suggestions.
>
> -Jack
>
>
>
> On Tue, Dec 12, 2023 at 6:21 PM Renjie Liu <liurenjie2...@gmail.com>
> wrote:
>
>> Hi, Rahi and Jack:
>>
>> Thanks for raising this.
>>
>> My question is that the pagination and sharding will make the rest server
>> stateful, e.g. a sequence of calls is required to go to the same server. In
>> this case, how do we ensure the scalability of the rest server?
>>
>>
>> On Wed, Dec 13, 2023 at 4:09 AM Fokko Driesprong <fo...@apache.org>
>> wrote:
>>
>>> Hey Rahil and Jack,
>>>
>>> Thanks for bringing this up. Ryan and I also discussed this briefly in
>>> the early days of PyIceberg and it would have helped a lot in the speed of
>>> development. We went for the traditional approach because that would also
>>> support all the other catalogs, but now that the REST catalog is taking
>>> off, I think it still makes a lot of sense to get it in.
>>>
>>> I do share the concern raised Ryan around the concepts of shards and
>>> pagination. For PyIceberg (but also for Go, Rust, and DuckDB) that are
>>> living in a single process today the concept of shards doesn't add value. I
>>> see your concern with long-running jobs, but for the non-distributed cases,
>>> it will add additional complexity.
>>>
>>> Some suggestions that come to mind:
>>>
>>>    - Stream the tasks directly back using a chunked response, reducing
>>>    the latency to the first task. This would also solve things with the
>>>    pagination. The only downside I can think of is having delete files where
>>>    you first need to make sure there are deletes relevant to the task, this
>>>    might increase latency to the first task.
>>>    - Making the sharding optional. If you want to shard you call the
>>>    CreateScan first and then call the GetScanTask with the IDs. If you don't
>>>    want to shard, you omit the shard parameter and fetch the tasks directly
>>>    (here we need also replace the scan string with the full
>>>    column/expression/snapshot-id etc).
>>>
>>> Looking forward to discussing this tomorrow in the community sync
>>> <https://iceberg.apache.org/community/#iceberg-community-events>!
>>>
>>> Kind regards,
>>> Fokko
>>>
>>>
>>>
>>> Op ma 11 dec 2023 om 19:05 schreef Jack Ye <yezhao...@gmail.com>:
>>>
>>>> Hi Ryan, thanks for the feedback!
>>>>
>>>> I was a part of this design discussion internally and can provide more
>>>> details. One reason for separating the CreateScan operation was to make the
>>>> API asynchronous and thus keep HTTP communications short. Consider the case
>>>> where we only have GetScanTasks API, and there is no shard specified. It
>>>> might take tens of seconds, or even minutes to read through all the
>>>> manifest list and manifests before being able to return anything. This
>>>> means the HTTP connection has to remain open during that period, which is
>>>> not really a good practice in general (consider connection failure, load
>>>> balancer and proxy load, etc.). And when we shift the API to asynchronous,
>>>> it basically becomes something like the proposal, where a stateful ID is
>>>> generated to be able to immediately return back to the client, and the
>>>> client get results by referencing the ID. So in our current prototype
>>>> implementation we are actually keeping this ID and the whole REST service
>>>> is stateful.
>>>>
>>>> There were some thoughts we had about the possibility to define a
>>>> "shard ID generator" protocol: basically the client agrees with the service
>>>> a way to deterministically generate shard IDs, and service uses it to
>>>> create shards. That sounds like what you are suggesting here, and it pushes
>>>> the responsibility to the client side to determine the parallelism. But in
>>>> some bad cases (e.g. there are many delete files and we need to read all
>>>> those in each shard to apply filters), it seems like there might still be
>>>> the long open connection issue above. What is your thought on that?
>>>>
>>>> -Jack
>>>>
>>>> On Sun, Dec 10, 2023 at 10:27 AM Ryan Blue <b...@tabular.io> wrote:
>>>>
>>>>> Rahil, thanks for working on this. It has some really good ideas that
>>>>> we hadn't considered before like a way for the service to plan how to 
>>>>> break
>>>>> up the work of scan planning. I really like that idea because it makes it
>>>>> much easier for the service to keep memory consumption low across 
>>>>> requests.
>>>>>
>>>>> My primary feedback is that I think it's a little too complicated
>>>>> (with both sharding and pagination) and could be modified slightly so that
>>>>> the service doesn't need to be stateful. If the service isn't necessarily
>>>>> stateful then it should be easier to build implementations.
>>>>>
>>>>> To make it possible for the service to be stateless, I'm proposing
>>>>> that rather than creating shard IDs that are tracked by the service, the
>>>>> information for a shard can be sent to the client. My assumption here is
>>>>> that most implementations would create shards by reading the manifest 
>>>>> list,
>>>>> filtering on partition ranges, and creating a shard for some reasonable
>>>>> size of manifest content. For example, if a table has 100MB of metadata in
>>>>> 25 manifests that are about 4 MB each, then it might create 9 shards with
>>>>> 1-4 manifests each. The service could send those shards to the client as a
>>>>> list of manifests to read and the client could send the shard information
>>>>> back to the service to get the data files in each shard (along with the
>>>>> original filter).
>>>>>
>>>>> There's a slight trade-off that the protocol needs to define how to
>>>>> break the work into shards. I'm interested in hearing if that would work
>>>>> with how you were planning on building the service on your end. Another
>>>>> option is to let the service send back arbitrary JSON that would get
>>>>> returned for each shard. Either way, I like that this would make it so the
>>>>> service doesn't need to persist anything. We could also make it so that
>>>>> small tables don't require multiple requests. For example, a client could
>>>>> call the route to get file tasks with just a filter.
>>>>>
>>>>> What do you think?
>>>>>
>>>>> Ryan
>>>>>
>>>>> On Fri, Dec 8, 2023 at 10:41 AM Chertara, Rahil
>>>>> <rcher...@amazon.com.invalid> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> My name is Rahil Chertara, and I’m a part of the Iceberg team at
>>>>>> Amazon EMR and Athena. I’m reaching out to share a proposal for a new 
>>>>>> Scan
>>>>>> API that will be utilized by the RESTCatalog. The process for table scan
>>>>>> planning is currently done within client engines such as Apache Spark. By
>>>>>> moving scan functionality to the RESTCatalog, we can integrate Iceberg
>>>>>> table scans with external services, which can lead to several benefits.
>>>>>>
>>>>>> For example, we can leverage caching and indexes on the server side
>>>>>> to improve planning performance. Furthermore, by moving this scan logic 
>>>>>> to
>>>>>> the RESTCatalog, non-JVM engines can integrate more easily. This all can 
>>>>>> be
>>>>>> found in the detailed proposal below. Feel free to comment, and add your
>>>>>> suggestions .
>>>>>>
>>>>>> Detailed proposal:
>>>>>> https://docs.google.com/document/d/1FdjCnFZM1fNtgyb9-v9fU4FwOX4An-pqEwSaJe8RgUg/edit#heading=h.cftjlkb2wh4h
>>>>>>
>>>>>> Github POC: https://github.com/apache/iceberg/pull/9252
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> Rahil Chertara
>>>>>> Amazon EMR & Athena
>>>>>> rcher...@amazon.com
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Tabular
>>>>>
>>>>

Re: Proposal for REST APIs for Iceberg table scans

Reply via email to