Re: Proposal for REST APIs for Iceberg table scans

Ryan Blue Wed, 13 Dec 2023 15:50:26 -0800

I just changed it to POST after looking into support for the QUERY method.
It's a new HTTP method for cases like this where you don't want to pass
everything through query params. Here's the QUERY method RFC
<https://www.ietf.org/archive/id/draft-ietf-httpbis-safe-method-w-body-02.html>,
but I guess it isn't finalized yet?


Just read them like you would a POST request that doesn't actually create
anything.

On Wed, Dec 13, 2023 at 3:45 PM Jack Ye <yezhao...@gmail.com> wrote:

> Thanks, the Gist explains a lot of things. This is actually very close to
> our way of implementing the shard ID, we were defining the shard ID as a
> string, and the string content is actually something similar to the
> information of the JSON payload you showed, so we can persist minimum
> information in storage.
>
> Just one clarification needed for your Gist:
>
> > QUERY /v1/namespaces/ns/tables/t/scans?scan-id=1
> > { "shard": { "id": 1, "manifests": ["C"] }, "filter": {"type": "in",
> "term": "x", "values": [1, 2, 3] } }
> >
> > { "file-scan-tasks": [...] }
>
> Here, what does this QUERY verb mean? Is that a GET? If it's GET, we
> cannot have a request body. That's actually why we expressed that as an ID
> string, since we can put it as a query parameter.
>
> -Jack
>
>
>
>
>
>
>
>
>
> On Wed, Dec 13, 2023 at 3:25 PM Ryan Blue <b...@tabular.io> wrote:
>
>> Jack,
>>
>> It sounds like what I’m proposing isn’t quite clear because your initial
>> response was arguing for a sharding capability. I agree that sharding is a
>> good idea. I’m less confident about two points:
>>
>>    1. Requiring that the service is stateful. As Renjie pointed out,
>>    that makes it harder to scale the service.
>>    2. The need for both pagination *and* sharding as separate things
>>
>> And I also think that Fokko has a good point about trying to keep things
>> simple and not requiring the CreateScan endpoint.
>>
>> For the first point, I’m proposing that we still have a CreateScan
>> endpoint, but instead of sending only a list of shard IDs it can also send
>> either a standard shard “task” or an optional JSON definition. Let’s assume
>> we can send arbitrary JSON for an example. Say I have a table with 4
>> manifests, A through D and that C and D match some query filter. When I
>> call the CreateScan endpoint, the service would send back tasks with
>> that information: {"id": 1, "manifests": ["C"]}, {"id": 2, "manifests":
>> ["D"]}. By sending what the shards mean (the manifests to read), my
>> service can be stateless: any node can get a request for shard 1, read
>> manifest C, and send back the resulting data files.
>>
>> I don’t see much of an argument against doing this *in principle*. It
>> gives you the flexibility to store state if you choose or to send state to
>> the client for it to pass back when calling the GetTasks endpoint. There
>> is a practical problem, which is that it’s annoying to send a GET request
>> with a JSON payload because you can’t send a request body. It’s probably
>> obvious, but I’m also not a REST purist so I’d be fine using POST or QUERY
>> for this. It would look something like this Gist
>> <https://gist.github.com/rdblue/d2b65bd2ad20f85ee9d04ccf19ac8aba>.
>>
>> In your last reply, you also asked whether a stateless service is a goal.
>> I don’t think that it is, but if we can make simple changes to the spec to
>> allow more flexibility on the server side, I think that’s a good direction.
>> You also asked about a reference implementation and I consider
>> CatalogHandlers
>> <https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/rest/CatalogHandlers.java>
>> to be that reference. It does everything except for the work done by your
>> choice of web application framework. It isn’t stateless, but it only relies
>> on a Catalog implementation for persistence.
>>
>> For the second point, I don’t understand why we need both sharding and
>> pagination. That is, if we have a protocol that allows sharding, why is
>> pagination also needed? From my naive perspective on how sharding would
>> work, we should be able to use metadata from the manifest list to limit the
>> potential number of data files in a given shard. As long as we can limit
>> the size of a shard to produce more, pagination seems like unnecessary
>> complication.
>>
>> Lastly, for Fokko’s point, I think another easy extension to the proposal
>> is to support a direct call to GetTasks. There’s a trade-off here, but
>> if you’re already sending the original filter along with the request (in
>> order to filter records from manifest C for instance) then the request
>> is already something the protocol can express. There’s an objection
>> concerning resource consumption on the service and creating responses that
>> are too large or take too long, but we can get around that by responding
>> with a code that instructs the client to use the CreateScan API like 413
>> (Payload too large). I think that would allow simple clients to function
>> for all but really large tables. The gist above also shows what this might
>> look like.
>>
>> Ryan
>>
>> On Wed, Dec 13, 2023 at 11:53 AM Jack Ye <yezhao...@gmail.com> wrote:
>>
>>> The current proposal definitely makes the server stateful. In our
>>> prototype we used other components like DynamoDB to keep track of states.
>>> If keeping it stateless is a tenant we can definitely make the proposal
>>> closer to that direction. Maybe one thing to make sure is, is this a core
>>> tenant of the REST spec? Today we do not even have an official reference
>>> implementation of the REST server, I feel it is hard to say what are the
>>> core tenants. Maybe we should create one?
>>>
>>> Pagination is a common issue in the REST spec. We also see similar
>>> limitations with other APIs like GetTables, GetNamespaces. When a catalog
>>> has many namespaces and tables it suffers from the same issue. It is also
>>> not ideal for use cases like web browsers, since typically you display a
>>> small page of results and do not need the full list immediately. So I feel
>>> we cannot really avoid some state to be kept for those use cases.
>>>
>>> Chunked response might be a good way to work around it. We also thought
>>> about using HTTP2. However, these options seem to be not very compatible
>>> with OpenAPI. We can do some further research in this domain, would really
>>> appreciate it if anyone has more insights and experience with OpenAPI that
>>> can provide some suggestions.
>>>
>>> -Jack
>>>
>>>
>>>
>>> On Tue, Dec 12, 2023 at 6:21 PM Renjie Liu <liurenjie2...@gmail.com>
>>> wrote:
>>>
>>>> Hi, Rahi and Jack:
>>>>
>>>> Thanks for raising this.
>>>>
>>>> My question is that the pagination and sharding will make the rest
>>>> server stateful, e.g. a sequence of calls is required to go to the same
>>>> server. In this case, how do we ensure the scalability of the rest server?
>>>>
>>>>
>>>> On Wed, Dec 13, 2023 at 4:09 AM Fokko Driesprong <fo...@apache.org>
>>>> wrote:
>>>>
>>>>> Hey Rahil and Jack,
>>>>>
>>>>> Thanks for bringing this up. Ryan and I also discussed this briefly in
>>>>> the early days of PyIceberg and it would have helped a lot in the speed of
>>>>> development. We went for the traditional approach because that would also
>>>>> support all the other catalogs, but now that the REST catalog is taking
>>>>> off, I think it still makes a lot of sense to get it in.
>>>>>
>>>>> I do share the concern raised Ryan around the concepts of shards and
>>>>> pagination. For PyIceberg (but also for Go, Rust, and DuckDB) that are
>>>>> living in a single process today the concept of shards doesn't add value. 
>>>>> I
>>>>> see your concern with long-running jobs, but for the non-distributed 
>>>>> cases,
>>>>> it will add additional complexity.
>>>>>
>>>>> Some suggestions that come to mind:
>>>>>
>>>>>    - Stream the tasks directly back using a chunked
>>>>>    response, reducing the latency to the first task. This would also solve
>>>>>    things with the pagination. The only downside I can think of is having
>>>>>    delete files where you first need to make sure there are deletes 
>>>>> relevant
>>>>>    to the task, this might increase latency to the first task.
>>>>>    - Making the sharding optional. If you want to shard you call the
>>>>>    CreateScan first and then call the GetScanTask with the IDs. If you 
>>>>> don't
>>>>>    want to shard, you omit the shard parameter and fetch the tasks 
>>>>> directly
>>>>>    (here we need also replace the scan string with the full
>>>>>    column/expression/snapshot-id etc).
>>>>>
>>>>> Looking forward to discussing this tomorrow in the community sync
>>>>> <https://iceberg.apache.org/community/#iceberg-community-events>!
>>>>>
>>>>> Kind regards,
>>>>> Fokko
>>>>>
>>>>>
>>>>>
>>>>> Op ma 11 dec 2023 om 19:05 schreef Jack Ye <yezhao...@gmail.com>:
>>>>>
>>>>>> Hi Ryan, thanks for the feedback!
>>>>>>
>>>>>> I was a part of this design discussion internally and can provide
>>>>>> more details. One reason for separating the CreateScan operation was to
>>>>>> make the API asynchronous and thus keep HTTP communications short. 
>>>>>> Consider
>>>>>> the case where we only have GetScanTasks API, and there is no shard
>>>>>> specified. It might take tens of seconds, or even minutes to read through
>>>>>> all the manifest list and manifests before being able to return anything.
>>>>>> This means the HTTP connection has to remain open during that period, 
>>>>>> which
>>>>>> is not really a good practice in general (consider connection failure, 
>>>>>> load
>>>>>> balancer and proxy load, etc.). And when we shift the API to 
>>>>>> asynchronous,
>>>>>> it basically becomes something like the proposal, where a stateful ID is
>>>>>> generated to be able to immediately return back to the client, and the
>>>>>> client get results by referencing the ID. So in our current prototype
>>>>>> implementation we are actually keeping this ID and the whole REST service
>>>>>> is stateful.
>>>>>>
>>>>>> There were some thoughts we had about the possibility to define a
>>>>>> "shard ID generator" protocol: basically the client agrees with the 
>>>>>> service
>>>>>> a way to deterministically generate shard IDs, and service uses it to
>>>>>> create shards. That sounds like what you are suggesting here, and it 
>>>>>> pushes
>>>>>> the responsibility to the client side to determine the parallelism. But 
>>>>>> in
>>>>>> some bad cases (e.g. there are many delete files and we need to read all
>>>>>> those in each shard to apply filters), it seems like there might still be
>>>>>> the long open connection issue above. What is your thought on that?
>>>>>>
>>>>>> -Jack
>>>>>>
>>>>>> On Sun, Dec 10, 2023 at 10:27 AM Ryan Blue <b...@tabular.io> wrote:
>>>>>>
>>>>>>> Rahil, thanks for working on this. It has some really good ideas
>>>>>>> that we hadn't considered before like a way for the service to plan how 
>>>>>>> to
>>>>>>> break up the work of scan planning. I really like that idea because it
>>>>>>> makes it much easier for the service to keep memory consumption low 
>>>>>>> across
>>>>>>> requests.
>>>>>>>
>>>>>>> My primary feedback is that I think it's a little too complicated
>>>>>>> (with both sharding and pagination) and could be modified slightly so 
>>>>>>> that
>>>>>>> the service doesn't need to be stateful. If the service isn't 
>>>>>>> necessarily
>>>>>>> stateful then it should be easier to build implementations.
>>>>>>>
>>>>>>> To make it possible for the service to be stateless, I'm proposing
>>>>>>> that rather than creating shard IDs that are tracked by the service, the
>>>>>>> information for a shard can be sent to the client. My assumption here is
>>>>>>> that most implementations would create shards by reading the manifest 
>>>>>>> list,
>>>>>>> filtering on partition ranges, and creating a shard for some reasonable
>>>>>>> size of manifest content. For example, if a table has 100MB of metadata 
>>>>>>> in
>>>>>>> 25 manifests that are about 4 MB each, then it might create 9 shards 
>>>>>>> with
>>>>>>> 1-4 manifests each. The service could send those shards to the client 
>>>>>>> as a
>>>>>>> list of manifests to read and the client could send the shard 
>>>>>>> information
>>>>>>> back to the service to get the data files in each shard (along with the
>>>>>>> original filter).
>>>>>>>
>>>>>>> There's a slight trade-off that the protocol needs to define how to
>>>>>>> break the work into shards. I'm interested in hearing if that would work
>>>>>>> with how you were planning on building the service on your end. Another
>>>>>>> option is to let the service send back arbitrary JSON that would get
>>>>>>> returned for each shard. Either way, I like that this would make it so 
>>>>>>> the
>>>>>>> service doesn't need to persist anything. We could also make it so that
>>>>>>> small tables don't require multiple requests. For example, a client 
>>>>>>> could
>>>>>>> call the route to get file tasks with just a filter.
>>>>>>>
>>>>>>> What do you think?
>>>>>>>
>>>>>>> Ryan
>>>>>>>
>>>>>>> On Fri, Dec 8, 2023 at 10:41 AM Chertara, Rahil
>>>>>>> <rcher...@amazon.com.invalid> wrote:
>>>>>>>
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> My name is Rahil Chertara, and I’m a part of the Iceberg team at
>>>>>>>> Amazon EMR and Athena. I’m reaching out to share a proposal for a new 
>>>>>>>> Scan
>>>>>>>> API that will be utilized by the RESTCatalog. The process for table 
>>>>>>>> scan
>>>>>>>> planning is currently done within client engines such as Apache Spark. 
>>>>>>>> By
>>>>>>>> moving scan functionality to the RESTCatalog, we can integrate Iceberg
>>>>>>>> table scans with external services, which can lead to several benefits.
>>>>>>>>
>>>>>>>> For example, we can leverage caching and indexes on the server side
>>>>>>>> to improve planning performance. Furthermore, by moving this scan 
>>>>>>>> logic to
>>>>>>>> the RESTCatalog, non-JVM engines can integrate more easily. This all 
>>>>>>>> can be
>>>>>>>> found in the detailed proposal below. Feel free to comment, and add 
>>>>>>>> your
>>>>>>>> suggestions .
>>>>>>>>
>>>>>>>> Detailed proposal:
>>>>>>>> https://docs.google.com/document/d/1FdjCnFZM1fNtgyb9-v9fU4FwOX4An-pqEwSaJe8RgUg/edit#heading=h.cftjlkb2wh4h
>>>>>>>>
>>>>>>>> Github POC: https://github.com/apache/iceberg/pull/9252
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>>
>>>>>>>> Rahil Chertara
>>>>>>>> Amazon EMR & Athena
>>>>>>>> rcher...@amazon.com
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Ryan Blue
>>>>>>> Tabular
>>>>>>>
>>>>>>
>>
>> --
>> Ryan Blue
>> Tabular
>>
>

-- 
Ryan Blue
Tabular

Re: Proposal for REST APIs for Iceberg table scans

Reply via email to