Re: Proposal for REST APIs for Iceberg table scans

Renjie Liu Wed, 13 Dec 2023 18:37:46 -0800

About the pagination part, I did some investigation and found that openapi
doesn't have spec about streaming responses, but it's actually
implementation detail. There are several ways to implement json streaming
<https://en.wikipedia.org/wiki/JSON_streaming> , and also there is  an rfc
<https://www.rfc-editor.org/rfc/rfc7464.html> for json-seq mime type, but
it seems not supported widely.


In practice kubernetes has used streaming to implement watch api
<https://kubernetes.io/docs/reference/using-api/api-concepts/#streaming-lists>
 .

I think we can choose one of the approaches to avoid pagination. Also it's
worth noticing that the http client needs to support http streaming, but I
think it would not be a block nowadays.

On Thu, Dec 14, 2023 at 8:39 AM Jack Ye <yezhao...@gmail.com> wrote:

> Seems like that track has expired (This Internet-Draft will expire on 13
> May 2022), not sure how these RFCs are managed, but it does not seem
> hopeful to have this verb in. I think people are mostly using POST for this
> use case already.
>
> But overall I think we are in agreement with the general direction. A few
> detail discussions:
>
> *Distinguish planning using shard or not*
> Maybe we should add a query parameter like *distributed=true* to
> distinguish your first and third case, since they are now sharing the same
> signature. If the requester wants to use distributed planning, then some
> sharding strategy is provided as a response for the requester to send more
> detailed requests.
>
> *Necessity of scan ID*
> In this approach, is scan ID still required? Because the shard payload
> already fully describes the information to retrieve, it seems like we can
> just drop the *scan-id* query parameter in the second case. Seems like
> it's kept for the case if we still want to persist some state, but it seems
> like we can make a stateless style fully working.
>
> *Shape of shard payload*
> What do you think is necessary information of the shard payload? It seems
> like we need at least the location of the manifests, plus the delete
> manifests or delete files associated with the manifests. I like the idea of
> making it a "shard task" that is similar to a file scan task, and it might
> allow us to return a mixture of both types of tasks, so we can have better
> control of the response size.
>
> -Jack
>
> On Wed, Dec 13, 2023 at 3:50 PM Ryan Blue <b...@tabular.io> wrote:
>
>> I just changed it to POST after looking into support for the QUERY
>> method. It's a new HTTP method for cases like this where you don't want to
>> pass everything through query params. Here's the QUERY method RFC
>> <https://www.ietf.org/archive/id/draft-ietf-httpbis-safe-method-w-body-02.html>,
>> but I guess it isn't finalized yet?
>>
>> Just read them like you would a POST request that doesn't actually create
>> anything.
>>
>> On Wed, Dec 13, 2023 at 3:45 PM Jack Ye <yezhao...@gmail.com> wrote:
>>
>>> Thanks, the Gist explains a lot of things. This is actually very close
>>> to our way of implementing the shard ID, we were defining the shard ID as a
>>> string, and the string content is actually something similar to the
>>> information of the JSON payload you showed, so we can persist minimum
>>> information in storage.
>>>
>>> Just one clarification needed for your Gist:
>>>
>>> > QUERY /v1/namespaces/ns/tables/t/scans?scan-id=1
>>> > { "shard": { "id": 1, "manifests": ["C"] }, "filter": {"type": "in",
>>> "term": "x", "values": [1, 2, 3] } }
>>> >
>>> > { "file-scan-tasks": [...] }
>>>
>>> Here, what does this QUERY verb mean? Is that a GET? If it's GET, we
>>> cannot have a request body. That's actually why we expressed that as an ID
>>> string, since we can put it as a query parameter.
>>>
>>> -Jack
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Dec 13, 2023 at 3:25 PM Ryan Blue <b...@tabular.io> wrote:
>>>
>>>> Jack,
>>>>
>>>> It sounds like what I’m proposing isn’t quite clear because your
>>>> initial response was arguing for a sharding capability. I agree that
>>>> sharding is a good idea. I’m less confident about two points:
>>>>
>>>>    1. Requiring that the service is stateful. As Renjie pointed out,
>>>>    that makes it harder to scale the service.
>>>>    2. The need for both pagination *and* sharding as separate things
>>>>
>>>> And I also think that Fokko has a good point about trying to keep
>>>> things simple and not requiring the CreateScan endpoint.
>>>>
>>>> For the first point, I’m proposing that we still have a CreateScan
>>>> endpoint, but instead of sending only a list of shard IDs it can also send
>>>> either a standard shard “task” or an optional JSON definition. Let’s assume
>>>> we can send arbitrary JSON for an example. Say I have a table with 4
>>>> manifests, A through D and that C and D match some query filter. When
>>>> I call the CreateScan endpoint, the service would send back tasks with
>>>> that information: {"id": 1, "manifests": ["C"]}, {"id": 2,
>>>> "manifests": ["D"]}. By sending what the shards mean (the manifests to
>>>> read), my service can be stateless: any node can get a request for shard 1,
>>>> read manifest C, and send back the resulting data files.
>>>>
>>>> I don’t see much of an argument against doing this *in principle*. It
>>>> gives you the flexibility to store state if you choose or to send state to
>>>> the client for it to pass back when calling the GetTasks endpoint.
>>>> There is a practical problem, which is that it’s annoying to send a GET
>>>> request with a JSON payload because you can’t send a request body. It’s
>>>> probably obvious, but I’m also not a REST purist so I’d be fine using POST
>>>> or QUERY for this. It would look something like this Gist
>>>> <https://gist.github.com/rdblue/d2b65bd2ad20f85ee9d04ccf19ac8aba>.
>>>>
>>>> In your last reply, you also asked whether a stateless service is a
>>>> goal. I don’t think that it is, but if we can make simple changes to the
>>>> spec to allow more flexibility on the server side, I think that’s a good
>>>> direction. You also asked about a reference implementation and I consider
>>>> CatalogHandlers
>>>> <https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/rest/CatalogHandlers.java>
>>>> to be that reference. It does everything except for the work done by your
>>>> choice of web application framework. It isn’t stateless, but it only relies
>>>> on a Catalog implementation for persistence.
>>>>
>>>> For the second point, I don’t understand why we need both sharding and
>>>> pagination. That is, if we have a protocol that allows sharding, why is
>>>> pagination also needed? From my naive perspective on how sharding would
>>>> work, we should be able to use metadata from the manifest list to limit the
>>>> potential number of data files in a given shard. As long as we can limit
>>>> the size of a shard to produce more, pagination seems like unnecessary
>>>> complication.
>>>>
>>>> Lastly, for Fokko’s point, I think another easy extension to the
>>>> proposal is to support a direct call to GetTasks. There’s a trade-off
>>>> here, but if you’re already sending the original filter along with the
>>>> request (in order to filter records from manifest C for instance) then
>>>> the request is already something the protocol can express. There’s an
>>>> objection concerning resource consumption on the service and creating
>>>> responses that are too large or take too long, but we can get around that
>>>> by responding with a code that instructs the client to use the
>>>> CreateScan API like 413 (Payload too large). I think that would allow
>>>> simple clients to function for all but really large tables. The gist above
>>>> also shows what this might look like.
>>>>
>>>> Ryan
>>>>
>>>> On Wed, Dec 13, 2023 at 11:53 AM Jack Ye <yezhao...@gmail.com> wrote:
>>>>
>>>>> The current proposal definitely makes the server stateful. In our
>>>>> prototype we used other components like DynamoDB to keep track of states.
>>>>> If keeping it stateless is a tenant we can definitely make the proposal
>>>>> closer to that direction. Maybe one thing to make sure is, is this a core
>>>>> tenant of the REST spec? Today we do not even have an official reference
>>>>> implementation of the REST server, I feel it is hard to say what are the
>>>>> core tenants. Maybe we should create one?
>>>>>
>>>>> Pagination is a common issue in the REST spec. We also see similar
>>>>> limitations with other APIs like GetTables, GetNamespaces. When a catalog
>>>>> has many namespaces and tables it suffers from the same issue. It is also
>>>>> not ideal for use cases like web browsers, since typically you display a
>>>>> small page of results and do not need the full list immediately. So I feel
>>>>> we cannot really avoid some state to be kept for those use cases.
>>>>>
>>>>> Chunked response might be a good way to work around it. We also
>>>>> thought about using HTTP2. However, these options seem to be not very
>>>>> compatible with OpenAPI. We can do some further research in this domain,
>>>>> would really appreciate it if anyone has more insights and experience with
>>>>> OpenAPI that can provide some suggestions.
>>>>>
>>>>> -Jack
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Dec 12, 2023 at 6:21 PM Renjie Liu <liurenjie2...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi, Rahi and Jack:
>>>>>>
>>>>>> Thanks for raising this.
>>>>>>
>>>>>> My question is that the pagination and sharding will make the rest
>>>>>> server stateful, e.g. a sequence of calls is required to go to the same
>>>>>> server. In this case, how do we ensure the scalability of the rest 
>>>>>> server?
>>>>>>
>>>>>>
>>>>>> On Wed, Dec 13, 2023 at 4:09 AM Fokko Driesprong <fo...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> Hey Rahil and Jack,
>>>>>>>
>>>>>>> Thanks for bringing this up. Ryan and I also discussed this briefly
>>>>>>> in the early days of PyIceberg and it would have helped a lot in the 
>>>>>>> speed
>>>>>>> of development. We went for the traditional approach because that would
>>>>>>> also support all the other catalogs, but now that the REST catalog is
>>>>>>> taking off, I think it still makes a lot of sense to get it in.
>>>>>>>
>>>>>>> I do share the concern raised Ryan around the concepts of shards and
>>>>>>> pagination. For PyIceberg (but also for Go, Rust, and DuckDB) that are
>>>>>>> living in a single process today the concept of shards doesn't add 
>>>>>>> value. I
>>>>>>> see your concern with long-running jobs, but for the non-distributed 
>>>>>>> cases,
>>>>>>> it will add additional complexity.
>>>>>>>
>>>>>>> Some suggestions that come to mind:
>>>>>>>
>>>>>>>    - Stream the tasks directly back using a chunked
>>>>>>>    response, reducing the latency to the first task. This would also 
>>>>>>> solve
>>>>>>>    things with the pagination. The only downside I can think of is 
>>>>>>> having
>>>>>>>    delete files where you first need to make sure there are deletes 
>>>>>>> relevant
>>>>>>>    to the task, this might increase latency to the first task.
>>>>>>>    - Making the sharding optional. If you want to shard you call
>>>>>>>    the CreateScan first and then call the GetScanTask with the IDs. If 
>>>>>>> you
>>>>>>>    don't want to shard, you omit the shard parameter and fetch the tasks
>>>>>>>    directly (here we need also replace the scan string with the full
>>>>>>>    column/expression/snapshot-id etc).
>>>>>>>
>>>>>>> Looking forward to discussing this tomorrow in the community sync
>>>>>>> <https://iceberg.apache.org/community/#iceberg-community-events>!
>>>>>>>
>>>>>>> Kind regards,
>>>>>>> Fokko
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Op ma 11 dec 2023 om 19:05 schreef Jack Ye <yezhao...@gmail.com>:
>>>>>>>
>>>>>>>> Hi Ryan, thanks for the feedback!
>>>>>>>>
>>>>>>>> I was a part of this design discussion internally and can provide
>>>>>>>> more details. One reason for separating the CreateScan operation was to
>>>>>>>> make the API asynchronous and thus keep HTTP communications short. 
>>>>>>>> Consider
>>>>>>>> the case where we only have GetScanTasks API, and there is no shard
>>>>>>>> specified. It might take tens of seconds, or even minutes to read 
>>>>>>>> through
>>>>>>>> all the manifest list and manifests before being able to return 
>>>>>>>> anything.
>>>>>>>> This means the HTTP connection has to remain open during that period, 
>>>>>>>> which
>>>>>>>> is not really a good practice in general (consider connection failure, 
>>>>>>>> load
>>>>>>>> balancer and proxy load, etc.). And when we shift the API to 
>>>>>>>> asynchronous,
>>>>>>>> it basically becomes something like the proposal, where a stateful ID 
>>>>>>>> is
>>>>>>>> generated to be able to immediately return back to the client, and the
>>>>>>>> client get results by referencing the ID. So in our current prototype
>>>>>>>> implementation we are actually keeping this ID and the whole REST 
>>>>>>>> service
>>>>>>>> is stateful.
>>>>>>>>
>>>>>>>> There were some thoughts we had about the possibility to define a
>>>>>>>> "shard ID generator" protocol: basically the client agrees with the 
>>>>>>>> service
>>>>>>>> a way to deterministically generate shard IDs, and service uses it to
>>>>>>>> create shards. That sounds like what you are suggesting here, and it 
>>>>>>>> pushes
>>>>>>>> the responsibility to the client side to determine the parallelism. 
>>>>>>>> But in
>>>>>>>> some bad cases (e.g. there are many delete files and we need to read 
>>>>>>>> all
>>>>>>>> those in each shard to apply filters), it seems like there might still 
>>>>>>>> be
>>>>>>>> the long open connection issue above. What is your thought on that?
>>>>>>>>
>>>>>>>> -Jack
>>>>>>>>
>>>>>>>> On Sun, Dec 10, 2023 at 10:27 AM Ryan Blue <b...@tabular.io> wrote:
>>>>>>>>
>>>>>>>>> Rahil, thanks for working on this. It has some really good ideas
>>>>>>>>> that we hadn't considered before like a way for the service to plan 
>>>>>>>>> how to
>>>>>>>>> break up the work of scan planning. I really like that idea because it
>>>>>>>>> makes it much easier for the service to keep memory consumption low 
>>>>>>>>> across
>>>>>>>>> requests.
>>>>>>>>>
>>>>>>>>> My primary feedback is that I think it's a little too complicated
>>>>>>>>> (with both sharding and pagination) and could be modified slightly so 
>>>>>>>>> that
>>>>>>>>> the service doesn't need to be stateful. If the service isn't 
>>>>>>>>> necessarily
>>>>>>>>> stateful then it should be easier to build implementations.
>>>>>>>>>
>>>>>>>>> To make it possible for the service to be stateless, I'm proposing
>>>>>>>>> that rather than creating shard IDs that are tracked by the service, 
>>>>>>>>> the
>>>>>>>>> information for a shard can be sent to the client. My assumption here 
>>>>>>>>> is
>>>>>>>>> that most implementations would create shards by reading the manifest 
>>>>>>>>> list,
>>>>>>>>> filtering on partition ranges, and creating a shard for some 
>>>>>>>>> reasonable
>>>>>>>>> size of manifest content. For example, if a table has 100MB of 
>>>>>>>>> metadata in
>>>>>>>>> 25 manifests that are about 4 MB each, then it might create 9 shards 
>>>>>>>>> with
>>>>>>>>> 1-4 manifests each. The service could send those shards to the client 
>>>>>>>>> as a
>>>>>>>>> list of manifests to read and the client could send the shard 
>>>>>>>>> information
>>>>>>>>> back to the service to get the data files in each shard (along with 
>>>>>>>>> the
>>>>>>>>> original filter).
>>>>>>>>>
>>>>>>>>> There's a slight trade-off that the protocol needs to define how
>>>>>>>>> to break the work into shards. I'm interested in hearing if that 
>>>>>>>>> would work
>>>>>>>>> with how you were planning on building the service on your end. 
>>>>>>>>> Another
>>>>>>>>> option is to let the service send back arbitrary JSON that would get
>>>>>>>>> returned for each shard. Either way, I like that this would make it 
>>>>>>>>> so the
>>>>>>>>> service doesn't need to persist anything. We could also make it so 
>>>>>>>>> that
>>>>>>>>> small tables don't require multiple requests. For example, a client 
>>>>>>>>> could
>>>>>>>>> call the route to get file tasks with just a filter.
>>>>>>>>>
>>>>>>>>> What do you think?
>>>>>>>>>
>>>>>>>>> Ryan
>>>>>>>>>
>>>>>>>>> On Fri, Dec 8, 2023 at 10:41 AM Chertara, Rahil
>>>>>>>>> <rcher...@amazon.com.invalid> wrote:
>>>>>>>>>
>>>>>>>>>> Hi all,
>>>>>>>>>>
>>>>>>>>>> My name is Rahil Chertara, and I’m a part of the Iceberg team at
>>>>>>>>>> Amazon EMR and Athena. I’m reaching out to share a proposal for a 
>>>>>>>>>> new Scan
>>>>>>>>>> API that will be utilized by the RESTCatalog. The process for table 
>>>>>>>>>> scan
>>>>>>>>>> planning is currently done within client engines such as Apache 
>>>>>>>>>> Spark. By
>>>>>>>>>> moving scan functionality to the RESTCatalog, we can integrate 
>>>>>>>>>> Iceberg
>>>>>>>>>> table scans with external services, which can lead to several 
>>>>>>>>>> benefits.
>>>>>>>>>>
>>>>>>>>>> For example, we can leverage caching and indexes on the server
>>>>>>>>>> side to improve planning performance. Furthermore, by moving this 
>>>>>>>>>> scan
>>>>>>>>>> logic to the RESTCatalog, non-JVM engines can integrate more easily. 
>>>>>>>>>> This
>>>>>>>>>> all can be found in the detailed proposal below. Feel free to 
>>>>>>>>>> comment, and
>>>>>>>>>> add your suggestions .
>>>>>>>>>>
>>>>>>>>>> Detailed proposal:
>>>>>>>>>> https://docs.google.com/document/d/1FdjCnFZM1fNtgyb9-v9fU4FwOX4An-pqEwSaJe8RgUg/edit#heading=h.cftjlkb2wh4h
>>>>>>>>>>
>>>>>>>>>> Github POC: https://github.com/apache/iceberg/pull/9252
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>>
>>>>>>>>>> Rahil Chertara
>>>>>>>>>> Amazon EMR & Athena
>>>>>>>>>> rcher...@amazon.com
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Ryan Blue
>>>>>>>>> Tabular
>>>>>>>>>
>>>>>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Tabular
>>>>
>>>
>>
>> --
>> Ryan Blue
>> Tabular
>>
>

Re: Proposal for REST APIs for Iceberg table scans

Reply via email to