Re: Proposal for REST APIs for Iceberg table scans

Jean-Baptiste Onofré Wed, 19 Jun 2024 06:09:35 -0700

Hi Ryan,

You are right: I can't access the document either. AFAIR, Jack did the doc,
he will fix that soon I'm sure.


Regards
JB

On Wed, Jun 19, 2024 at 1:17 AM Ryan Blue <[email protected]> wrote:

> It looks like the design doc from the original email is no longer
> available. Could someone fix the permissions?
>
> On Mon, May 20, 2024 at 8:10 AM Jack Ye <[email protected]> wrote:
>
>> We merged the spec change for content file in
>> https://github.com/apache/iceberg/pull/9717, the next step is to merge
>> the PlanTable and PreplanTable API spec change in
>> https://github.com/apache/iceberg/pull/9695. I guess people were a bit
>> busy in the past few weeks due to the Iceberg summit, you should see more
>> progress pretty soon!
>>
>> Best,
>> Jack Ye
>>
>> On Sat, May 18, 2024 at 4:05 PM Pucheng Yang <[email protected]>
>> wrote:
>>
>>> Hi all,  I wonder if we have a ETA for this change? thanks
>>>
>>> On Wed, Jan 31, 2024 at 10:30 AM Chertara, Rahil
>>> <[email protected]> wrote:
>>>
>>>> Sure, I can look into adding this to the spec.
>>>> Thanks to everyone for sharing their thoughts, appreciate it!
>>>>
>>>>
>>>>
>>>> *From: *Ryan Blue <[email protected]>
>>>> *Reply-To: *"[email protected]" <[email protected]>
>>>> *Date: *Wednesday, January 31, 2024 at 10:22 AM
>>>> *To: *"[email protected]" <[email protected]>
>>>> *Subject: *RE: [EXTERNAL] Proposal for REST APIs for Iceberg table
>>>> scans
>>>>
>>>>
>>>>
>>>> *CAUTION*: This email originated from outside of the organization. Do
>>>> not click links or open attachments unless you can confirm the sender and
>>>> know the content is safe.
>>>>
>>>>
>>>>
>>>> Looks good to me! Should we get a PR up to add it to the OpenAPI spec?
>>>>
>>>>
>>>>
>>>> On Wed, Jan 31, 2024 at 10:16 AM Jack Ye <[email protected]> wrote:
>>>>
>>>> Sounds good. I don't really have any strong opinions here. So looks
>>>> like we are landing on this?
>>>>
>>>>
>>>>
>>>>
>>>> *PreplanTable: POST /v1/namespaces/ns/tables/t/preplan *{ "filter": {
>>>> "type": "in", "term": "x", "values": [1, 2, 3] }, "select": ["x", "a.b"] }
>>>>
>>>> { "plan-tasks": [ { ... },  { ... } ] } // opaque object
>>>>
>>>>
>>>> *PlanTable w/o a plan task: POST /v1/namespaces/ns/tables/t/plan *{
>>>> "filter": {"type": "in", "term": "x", "values": [1, 2, 3] }, "select":
>>>> ["x", "a.b"] }
>>>>
>>>>
>>>> { "file-scan-tasks": [ { ... }, { ... } ] } // FileScanTask OpenAPI
>>>> model
>>>>
>>>>
>>>>
>>>> *PlanTable w/ a plan task: POST /v1/namespaces/ns/tables/t/plan *{
>>>> "filter": {"type": "in", "term": "x", "values": [1, 2, 3] }, "select":
>>>> ["x", "a.b"], "plan-task": { ... } }
>>>>
>>>> { "file-scan-tasks": [ { ... }, { ... } ] }
>>>>
>>>>
>>>>
>>>> -Jack
>>>>
>>>>
>>>>
>>>> On Wed, Jan 31, 2024 at 10:08 AM Ryan Blue <[email protected]> wrote:
>>>>
>>>> I agree with Dan. I'd rather have two endpoints instead of needing an
>>>> option that changes the behavior entirely in the same route. I don't think
>>>> that a `preplan` route would be too bad.
>>>>
>>>>
>>>>
>>>> On Wed, Jan 31, 2024 at 9:51 AM Daniel Weeks <[email protected]> wrote:
>>>>
>>>> I agree with the opaque tokens.
>>>>
>>>>
>>>>
>>>> However, I'm concerned we're overloading the endpoint two perform two
>>>> distinctly different operations: distribute a plan and scan a plan.
>>>>
>>>>
>>>>
>>>> Changing the task-type then changes the behavior and the result.  I
>>>> feel it would be more straightforward to separate the distribute and scan
>>>> endpoints.  Then clients can call the scan directly if they do not know how
>>>> to distribute and the behavior is clear from the REST Specification.
>>>>
>>>>
>>>>
>>>> -Dan
>>>>
>>>>
>>>>
>>>> On Tue, Jan 30, 2024 at 9:09 PM Jack Ye <[email protected]> wrote:
>>>>
>>>> +1 for having the opaque plan tasks, that's probably the most flexible
>>>> way forward. And let's call them *plan tasks* going forward to
>>>> standardize the terminology.
>>>>
>>>>
>>>>
>>>> I think the name of the APIs can be determined based on the actual API
>>>> shape. For example, if we centralize these 2 plan and pre-plan actions to a
>>>> single API endpoint but just requesting different task types:
>>>>
>>>>
>>>>
>>>>
>>>> *pre-plan: POST /v1/namespaces/ns/tables/t/plan *{ "filter": { "type":
>>>> "in", "term": "x", "values": [1, 2, 3] }, "select": ["x", "a.b"],
>>>> "task-type": "plan-task" }
>>>>
>>>> { "plan-tasks": [ { ... },  { ... } ] }
>>>>
>>>>
>>>> *plan without a plan-task: POST /v1/namespaces/ns/tables/t/plan *{
>>>> "filter": {"type": "in", "term": "x", "values": [1, 2, 3] }, "select":
>>>> ["x", "a.b"], "task-type": "file-scan-task" } // file-scan-task should be
>>>> the default type
>>>>
>>>>
>>>> { "file-scan-tasks": [ { ... }, { ... } ] }
>>>>
>>>>
>>>>
>>>> *plan with a plan-task: POST /v1/namespaces/ns/tables/t/plan *{
>>>> "filter": {"type": "in", "term": "x", "values": [1, 2, 3] }, "select":
>>>> ["x", "a.b"], "task-type": "file-scan-task", "plan-task": { ... } }
>>>>
>>>> { "file-scan-tasks": [...] }
>>>>
>>>>
>>>>
>>>> In this model, we just have a single API, and we can call it something
>>>> like PlanTable or PlanTableScan.
>>>>
>>>>
>>>>
>>>> What do you think?
>>>>
>>>>
>>>>
>>>> -Jack
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Mon, Jan 29, 2024 at 6:17 PM Renjie Liu <[email protected]>
>>>> wrote:
>>>>
>>>> But to move forward, I think we should go with the option that
>>>> preserves flexibility. I think the spec should state that plan tasks (if we
>>>> call them that) are a JSON object that should be sent as-is back to the
>>>> REST service to be used.
>>>>
>>>>
>>>>
>>>> +1 for this.
>>>>
>>>>
>>>>
>>>> > One more thing that I would also change is that I don't think the
>>>> "plan" and "scan" endpoints make much sense. We refer to the "scan" portion
>>>> of this as "planFiles" in the reference implementation, and "scan" is used
>>>> for actually reading data. To be less confusing, I think that file scan
>>>> tasks should be returned by a "plan" endpoint and the manifest plan tasks
>>>> (or shards) should be returned by a "pre-plan" endpoint. Does anyone else
>>>> like the names "pre-plan" and "plan" better?
>>>>
>>>>
>>>>
>>>> I agree that "scan" may be quite confusing since it's actually planning
>>>> file scan. Another options I can provide is: "plan" -> "plan-table-scan",
>>>> "scan" -> "plan-file-scan"
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Jan 30, 2024 at 9:03 AM Ryan Blue <[email protected]> wrote:
>>>>
>>>> As you noted the main point we still need to decide on is whether to
>>>> have a standard "shard" definition (e.g. manifest plan task) or to allow it
>>>> to be opaque and specific to catalogs implementing the protocol. I've not
>>>> replied because I keep coming back to this decision and I'm not sure
>>>> whether the advantage is being clear about how it works (being explicit) or
>>>> allowing implementations to differ (opaque). I'm skeptical that there will
>>>> be other strategies.
>>>>
>>>>
>>>>
>>>> But to move forward, I think we should go with the option that
>>>> preserves flexibility. I think the spec should state that plan tasks (if we
>>>> call them that) are a JSON object that should be sent as-is back to the
>>>> REST service to be used.
>>>>
>>>>
>>>>
>>>> One more thing that I would also change is that I don't think the
>>>> "plan" and "scan" endpoints make much sense. We refer to the "scan" portion
>>>> of this as "planFiles" in the reference implementation, and "scan" is used
>>>> for actually reading data. To be less confusing, I think that file scan
>>>> tasks should be returned by a "plan" endpoint and the manifest plan tasks
>>>> (or shards) should be returned by a "pre-plan" endpoint. Does anyone else
>>>> like the names "pre-plan" and "plan" better?
>>>>
>>>>
>>>>
>>>> Ryan
>>>>
>>>>
>>>>
>>>> On Mon, Jan 29, 2024 at 12:02 PM Chertara, Rahil
>>>> <[email protected]> wrote:
>>>>
>>>> Hi All hope everyone is doing well,
>>>>
>>>>
>>>> Wanted to revive the discussion around the Rest Table Scan API work.
>>>> For a refresher here is the original proposal:
>>>> https://docs.google.com/document/d/1FdjCnFZM1fNtgyb9-v9fU4FwOX4An-pqEwSaJe8RgUg/edit#heading=h.cftjlkb2wh4h
>>>> as well as the PR: https://github.com/apache/iceberg/pull/9252
>>>>
>>>>
>>>> From the last messages on the thread, I believe Ryan and Jack were in
>>>> favor of having two distinct api endpoints /plan and /scan, as well as a
>>>> stricter json definition for the "shard”, here is an example below from
>>>> what was discussed.
>>>>
>>>>
>>>> *POST /v1/namespaces/ns/tables/t/plan *{ "filter": { "type": "in",
>>>> "term": "x", "values": [1, 2, 3] }, "select": ["x", "a.b"]}
>>>>
>>>> { "manifest-plan-tasks": [
>>>>   { "start": 0, "length": 1000, "manifest": { "path":
>>>> "s3://some/manifest.avro", ...}, "delete-manifests": [...] },
>>>>   { ... }
>>>> ]}
>>>>
>>>>
>>>> *POST /v1/namespaces/ns/tables/t/scan *{ "filter": {"type": "in",
>>>> "term": "x", "values": [1, 2, 3] },
>>>>
>>>>   "select": ["x", "a.b"],
>>>>   "manifest-plan-task": { "start": 0, "length": 1000, "manifest": {
>>>> "path": "s3://some/manifest.avro", ...}, "delete-manifests": [...] } }
>>>>
>>>> { "file-scan-tasks": [...] }
>>>>
>>>>
>>>> *POST /v1/namespaces/ns/tables/t/scan *{ "filter": {"type": "in",
>>>> "term": "x", "values": [1, 2, 3] }, "select": ["x", "a.b"]}
>>>>
>>>>
>>>> { "file-scan-tasks": [...] }
>>>>
>>>>
>>>>
>>>> However IIRC Micah and Renjie had some concerns around this stricter
>>>> structure as this can make it harder to evolve in the future, as well as
>>>> some potential scalability challenges for larger tables that have many
>>>> manifest files. (Feel free to expand further on the concerns if my
>>>> understanding is incorrect).
>>>>
>>>>
>>>>
>>>> Would appreciate if the community can leave any more thoughts/feedback
>>>> on this thread, as well as on the google doc, and the PR.
>>>>
>>>>
>>>>
>>>> Regards,
>>>> Rahil Chertara
>>>>
>>>>
>>>>
>>>> *From: *Renjie Liu <[email protected]>
>>>> *Reply-To: *"[email protected]" <[email protected]>
>>>> *Date: *Thursday, December 21, 2023 at 10:35 PM
>>>> *To: *"[email protected]" <[email protected]>
>>>> *Subject: *RE: [EXTERNAL] Proposal for REST APIs for Iceberg table
>>>> scans
>>>>
>>>>
>>>>
>>>> *CAUTION*: This email originated from outside of the organization. Do
>>>> not click links or open attachments unless you can confirm the sender and
>>>> know the content is safe.
>>>>
>>>>
>>>>
>>>> I share the same concern with Micah. The shard detail should be
>>>> implementation details of the server, rather than exposing directly to the
>>>> client. If the goal is to make things stateless, we just need to attach a
>>>> snapshot id + shard id, then a determined algorithm is supposed to give the
>>>> same result. Also another concern is for huge analytics tables, we may have
>>>> a lot of manifest files, which may lead to large traffic from the rest
>>>> server.
>>>>
>>>>
>>>>
>>>> On Thu, Dec 21, 2023 at 7:41 AM Micah Kornfield <[email protected]>
>>>> wrote:
>>>>
>>>> Also +1 for having a more strict definition of the shard. Having
>>>> arbitrary JSON was basically what we experimented with a string shard ID,
>>>> and we ended up with something very similar to the manifest plan task you
>>>> describe in the serialized ID string.
>>>>
>>>>
>>>>
>>>> IIUC the proposal correctly, I'd actually be -0.0 on the stricter
>>>> structure.  I think forcing a contract where it isn't strictly necessary
>>>> makes it harder to evolve the system in the future.  For example it makes
>>>> it harder to address potential scalability problems in a transparent way
>>>> (e.g. extreme edge cases in cardinality between manifest files and delete
>>>> files).
>>>>
>>>>
>>>>
>>>> It also seems like it might overly constrain implementations (it is not
>>>> clear we should need to compute the mapping between delete file manifests
>>>> to data file manifests up front to start planning).
>>>>
>>>>
>>>>
>>>> On Tue, Dec 19, 2023 at 2:10 PM Jack Ye <[email protected]> wrote:
>>>>
>>>> +1 for having /plan and /scan, sounds like a good idea to separate
>>>> those 2 distinct actions.
>>>>
>>>>
>>>>
>>>> Also +1 for having a more strict definition of the shard. Having
>>>> arbitrary JSON was basically what we experimented with a string shard ID,
>>>> and we ended up with something very similar to the manifest plan task you
>>>> describe in the serialized ID string.
>>>>
>>>>
>>>>
>>>> So sounds like we are converging to the following APIs:
>>>>
>>>>
>>>>
>>>>
>>>> *POST /v1/namespaces/ns/tables/t/plan *{ "filter": { "type": "in",
>>>> "term": "x", "values": [1, 2, 3] }, "select": ["x", "a.b"]}
>>>>
>>>> { "manifest-plan-tasks": [
>>>>   { "start": 0, "length": 1000, "manifest": { "path":
>>>> "s3://some/manifest.avro", ...}, "delete-manifests": [...] },
>>>>   { ... }
>>>> ]}
>>>>
>>>>
>>>> *POST /v1/namespaces/ns/tables/t/scan *{ "filter": {"type": "in",
>>>> "term": "x", "values": [1, 2, 3] },
>>>>
>>>>   "select": ["x", "a.b"],
>>>>   "manifest-plan-task": { "start": 0, "length": 1000, "manifest": {
>>>> "path": "s3://some/manifest.avro", ...}, "delete-manifests": [...] } }
>>>>
>>>> { "file-scan-tasks": [...] }
>>>>
>>>>
>>>> *POST /v1/namespaces/ns/tables/t/scan *{ "filter": {"type": "in",
>>>> "term": "x", "values": [1, 2, 3] }, "select": ["x", "a.b"]}
>>>>
>>>>
>>>> { "file-scan-tasks": [...] }
>>>>
>>>>
>>>>
>>>> If this sounds good overall, we can update the prototype to have more
>>>> detailed discussions in code.
>>>>
>>>>
>>>>
>>>> -Jack
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, Dec 14, 2023 at 6:10 PM Ryan Blue <[email protected]> wrote:
>>>>
>>>> The tasks might look something like this:
>>>>
>>>>
>>>>
>>>> CombinedPlanTask
>>>>
>>>> - List<ManifestPlanTask>
>>>>
>>>>
>>>>
>>>> ManifestPlanTask
>>>>
>>>> - int start
>>>>
>>>> - int length
>>>>
>>>> - ManifestFile dataManifest
>>>>
>>>> - List<ManifestFile> deleteManifests
>>>>
>>>>
>>>>
>>>> On Thu, Dec 14, 2023 at 4:07 PM Ryan Blue <[email protected]> wrote:
>>>>
>>>> Seems like that track has expired (This Internet-Draft will expire on
>>>> 13 May 2022)
>>>>
>>>> Yeah, looks like we should just use POST. That’s too bad. QUERY seems
>>>> like a good idea to me.
>>>>
>>>> Distinguish planning using shard or not
>>>>
>>>> I think this was a mistake on my part. I was still thinking that we
>>>> would have a different endpoint for first-level planning to produce shards
>>>> and the route to actually get files. Since both are POST requests with the
>>>> same path (/v1/namespaces/ns/tables/t/scans) that no longer works.
>>>> What about /v1/namespaces/ns/tables/t/scan and
>>>> /v1/namespaces/ns/tables/t/plan? The latter could use some variant of
>>>> planFiles since that’s what we are wrapping in the Java API.
>>>>
>>>> Necessity of scan ID
>>>>
>>>> Yes, I agree. If you have shard IDs then you don’t really need a scan
>>>> ID. You could always have one internally but send it as part of the shard
>>>> ID.
>>>>
>>>> Shape of shard payload
>>>>
>>>> I think we have 2 general options depending on how strict we want to be.
>>>>
>>>>    1. Require a standard shard definition
>>>>    2. Allow arbitrary JSON and leave it to the service
>>>>
>>>> I lean toward the first option, which would be a data manifest and the
>>>> associated delete manifests for the partition. We could also extend that to
>>>> a group of manifests, each with a list of delete manifests. And we could
>>>> also allow splitting to ensure tasks don’t get too large with big files.
>>>> This all looks basically like FileScanTask, but with manifests and delete
>>>> manifests.
>>>>
>>>>
>>>>
>>>> On Wed, Dec 13, 2023 at 4:39 PM Jack Ye <[email protected]> wrote:
>>>>
>>>> Seems like that track has expired (This Internet-Draft will expire on
>>>> 13 May 2022), not sure how these RFCs are managed, but it does not seem
>>>> hopeful to have this verb in. I think people are mostly using POST for this
>>>> use case already.
>>>>
>>>>
>>>>
>>>> But overall I think we are in agreement with the general direction. A
>>>> few detail discussions:
>>>>
>>>>
>>>>
>>>> *Distinguish planning using shard or not*
>>>>
>>>> Maybe we should add a query parameter like *distributed=true* to
>>>> distinguish your first and third case, since they are now sharing the same
>>>> signature. If the requester wants to use distributed planning, then some
>>>> sharding strategy is provided as a response for the requester to send more
>>>> detailed requests.
>>>>
>>>> *Necessity of scan ID*
>>>> In this approach, is scan ID still required? Because the shard payload
>>>> already fully describes the information to retrieve, it seems like we can
>>>> just drop the *scan-id* query parameter in the second case. Seems like
>>>> it's kept for the case if we still want to persist some state, but it seems
>>>> like we can make a stateless style fully working.
>>>>
>>>> *Shape of shard payload*
>>>> What do you think is necessary information of the shard payload? It
>>>> seems like we need at least the location of the manifests, plus the delete
>>>> manifests or delete files associated with the manifests. I like the idea of
>>>> making it a "shard task" that is similar to a file scan task, and it might
>>>> allow us to return a mixture of both types of tasks, so we can have better
>>>> control of the response size.
>>>>
>>>> -Jack
>>>>
>>>>
>>>>
>>>> On Wed, Dec 13, 2023 at 3:50 PM Ryan Blue <[email protected]> wrote:
>>>>
>>>> I just changed it to POST after looking into support for the QUERY
>>>> method. It's a new HTTP method for cases like this where you don't want to
>>>> pass everything through query params. Here's the QUERY method RFC
>>>> <https://www.ietf.org/archive/id/draft-ietf-httpbis-safe-method-w-body-02.html>,
>>>> but I guess it isn't finalized yet?
>>>>
>>>>
>>>>
>>>> Just read them like you would a POST request that doesn't actually
>>>> create anything.
>>>>
>>>>
>>>>
>>>> On Wed, Dec 13, 2023 at 3:45 PM Jack Ye <[email protected]> wrote:
>>>>
>>>> Thanks, the Gist explains a lot of things. This is actually very close
>>>> to our way of implementing the shard ID, we were defining the shard ID as a
>>>> string, and the string content is actually something similar to the
>>>> information of the JSON payload you showed, so we can persist minimum
>>>> information in storage.
>>>>
>>>>
>>>>
>>>> Just one clarification needed for your Gist:
>>>>
>>>>
>>>>
>>>> > QUERY /v1/namespaces/ns/tables/t/scans?scan-id=1
>>>>
>>>>
>>>>
>>>> > { "shard": { "id": 1, "manifests": ["C"] }, "filter": {"type": "in",
>>>> "term": "x", "values": [1, 2, 3] } }
>>>>
>>>>
>>>>
>>>> >
>>>>
>>>> > { "file-scan-tasks": [...] }
>>>>
>>>>
>>>>
>>>> Here, what does this QUERY verb mean? Is that a GET? If it's GET, we
>>>> cannot have a request body. That's actually why we expressed that as an ID
>>>> string, since we can put it as a query parameter.
>>>>
>>>>
>>>>
>>>> -Jack
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Dec 13, 2023 at 3:25 PM Ryan Blue <[email protected]> wrote:
>>>>
>>>> Jack,
>>>>
>>>> It sounds like what I’m proposing isn’t quite clear because your
>>>> initial response was arguing for a sharding capability. I agree that
>>>> sharding is a good idea. I’m less confident about two points:
>>>>
>>>>    1. Requiring that the service is stateful. As Renjie pointed out,
>>>>    that makes it harder to scale the service.
>>>>    2. The need for both pagination *and* sharding as separate things
>>>>
>>>> And I also think that Fokko has a good point about trying to keep
>>>> things simple and not requiring the CreateScan endpoint.
>>>>
>>>> For the first point, I’m proposing that we still have a CreateScan
>>>> endpoint, but instead of sending only a list of shard IDs it can also send
>>>> either a standard shard “task” or an optional JSON definition. Let’s assume
>>>> we can send arbitrary JSON for an example. Say I have a table with 4
>>>> manifests, A through D and that C and D match some query filter. When
>>>> I call the CreateScan endpoint, the service would send back tasks with
>>>> that information: {"id": 1, "manifests": ["C"]}, {"id": 2,
>>>> "manifests": ["D"]}. By sending what the shards mean (the manifests to
>>>> read), my service can be stateless: any node can get a request for shard 1,
>>>> read manifest C, and send back the resulting data files.
>>>>
>>>> I don’t see much of an argument against doing this *in principle*. It
>>>> gives you the flexibility to store state if you choose or to send state to
>>>> the client for it to pass back when calling the GetTasks endpoint.
>>>> There is a practical problem, which is that it’s annoying to send a GET
>>>> request with a JSON payload because you can’t send a request body. It’s
>>>> probably obvious, but I’m also not a REST purist so I’d be fine using POST
>>>> or QUERY for this. It would look something like this Gist
>>>> <https://gist.github.com/rdblue/d2b65bd2ad20f85ee9d04ccf19ac8aba>.
>>>>
>>>> In your last reply, you also asked whether a stateless service is a
>>>> goal. I don’t think that it is, but if we can make simple changes to the
>>>> spec to allow more flexibility on the server side, I think that’s a good
>>>> direction. You also asked about a reference implementation and I consider
>>>> CatalogHandlers
>>>> <https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/rest/CatalogHandlers.java>
>>>> to be that reference. It does everything except for the work done by your
>>>> choice of web application framework. It isn’t stateless, but it only relies
>>>> on a Catalog implementation for persistence.
>>>>
>>>> For the second point, I don’t understand why we need both sharding and
>>>> pagination. That is, if we have a protocol that allows sharding, why is
>>>> pagination also needed? From my naive perspective on how sharding would
>>>> work, we should be able to use metadata from the manifest list to limit the
>>>> potential number of data files in a given shard. As long as we can limit
>>>> the size of a shard to produce more, pagination seems like unnecessary
>>>> complication.
>>>>
>>>> Lastly, for Fokko’s point, I think another easy extension to the
>>>> proposal is to support a direct call to GetTasks. There’s a trade-off
>>>> here, but if you’re already sending the original filter along with the
>>>> request (in order to filter records from manifest C for instance) then
>>>> the request is already something the protocol can express. There’s an
>>>> objection concerning resource consumption on the service and creating
>>>> responses that are too large or take too long, but we can get around that
>>>> by responding with a code that instructs the client to use the
>>>> CreateScan API like 413 (Payload too large). I think that would allow
>>>> simple clients to function for all but really large tables. The gist above
>>>> also shows what this might look like.
>>>>
>>>> Ryan
>>>>
>>>>
>>>>
>>>> On Wed, Dec 13, 2023 at 11:53 AM Jack Ye <[email protected]> wrote:
>>>>
>>>> The current proposal definitely makes the server stateful. In our
>>>> prototype we used other components like DynamoDB to keep track of states.
>>>> If keeping it stateless is a tenant we can definitely make the proposal
>>>> closer to that direction. Maybe one thing to make sure is, is this a core
>>>> tenant of the REST spec? Today we do not even have an official reference
>>>> implementation of the REST server, I feel it is hard to say what are the
>>>> core tenants. Maybe we should create one?
>>>>
>>>>
>>>>
>>>> Pagination is a common issue in the REST spec. We also see similar
>>>> limitations with other APIs like GetTables, GetNamespaces. When a catalog
>>>> has many namespaces and tables it suffers from the same issue. It is also
>>>> not ideal for use cases like web browsers, since typically you display a
>>>> small page of results and do not need the full list immediately. So I feel
>>>> we cannot really avoid some state to be kept for those use cases.
>>>>
>>>>
>>>>
>>>> Chunked response might be a good way to work around it. We also thought
>>>> about using HTTP2. However, these options seem to be not very compatible
>>>> with OpenAPI. We can do some further research in this domain, would really
>>>> appreciate it if anyone has more insights and experience with OpenAPI that
>>>> can provide some suggestions.
>>>>
>>>>
>>>>
>>>> -Jack
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Dec 12, 2023 at 6:21 PM Renjie Liu <[email protected]>
>>>> wrote:
>>>>
>>>> Hi, Rahi and Jack:
>>>>
>>>> Thanks for raising this.
>>>>
>>>>
>>>>
>>>> My question is that the pagination and sharding will make the rest
>>>> server stateful, e.g. a sequence of calls is required to go to the same
>>>> server. In this case, how do we ensure the scalability of the rest server?
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Dec 13, 2023 at 4:09 AM Fokko Driesprong <[email protected]>
>>>> wrote:
>>>>
>>>> Hey Rahil and Jack,
>>>>
>>>>
>>>>
>>>> Thanks for bringing this up. Ryan and I also discussed this briefly in
>>>> the early days of PyIceberg and it would have helped a lot in the speed of
>>>> development. We went for the traditional approach because that would also
>>>> support all the other catalogs, but now that the REST catalog is taking
>>>> off, I think it still makes a lot of sense to get it in.
>>>>
>>>>
>>>>
>>>> I do share the concern raised Ryan around the concepts of shards and
>>>> pagination. For PyIceberg (but also for Go, Rust, and DuckDB) that are
>>>> living in a single process today the concept of shards doesn't add value. I
>>>> see your concern with long-running jobs, but for the non-distributed cases,
>>>> it will add additional complexity.
>>>>
>>>>
>>>>
>>>> Some suggestions that come to mind:
>>>>
>>>>    - Stream the tasks directly back using a chunked response, reducing
>>>>    the latency to the first task. This would also solve things with the
>>>>    pagination. The only downside I can think of is having delete files 
>>>> where
>>>>    you first need to make sure there are deletes relevant to the task, this
>>>>    might increase latency to the first task.
>>>>    - Making the sharding optional. If you want to shard you call the
>>>>    CreateScan first and then call the GetScanTask with the IDs. If you 
>>>> don't
>>>>    want to shard, you omit the shard parameter and fetch the tasks directly
>>>>    (here we need also replace the scan string with the full
>>>>    column/expression/snapshot-id etc).
>>>>
>>>> Looking forward to discussing this tomorrow in the community sync
>>>> <https://iceberg.apache.org/community/#iceberg-community-events>!
>>>>
>>>>
>>>>
>>>> Kind regards,
>>>>
>>>> Fokko
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Op ma 11 dec 2023 om 19:05 schreef Jack Ye <[email protected]>:
>>>>
>>>> Hi Ryan, thanks for the feedback!
>>>>
>>>>
>>>>
>>>> I was a part of this design discussion internally and can provide more
>>>> details. One reason for separating the CreateScan operation was to make the
>>>> API asynchronous and thus keep HTTP communications short. Consider the case
>>>> where we only have GetScanTasks API, and there is no shard specified. It
>>>> might take tens of seconds, or even minutes to read through all the
>>>> manifest list and manifests before being able to return anything. This
>>>> means the HTTP connection has to remain open during that period, which is
>>>> not really a good practice in general (consider connection failure, load
>>>> balancer and proxy load, etc.). And when we shift the API to asynchronous,
>>>> it basically becomes something like the proposal, where a stateful ID is
>>>> generated to be able to immediately return back to the client, and the
>>>> client get results by referencing the ID. So in our current prototype
>>>> implementation we are actually keeping this ID and the whole REST service
>>>> is stateful.
>>>>
>>>>
>>>>
>>>> There were some thoughts we had about the possibility to define a
>>>> "shard ID generator" protocol: basically the client agrees with the service
>>>> a way to deterministically generate shard IDs, and service uses it to
>>>> create shards. That sounds like what you are suggesting here, and it pushes
>>>> the responsibility to the client side to determine the parallelism. But in
>>>> some bad cases (e.g. there are many delete files and we need to read all
>>>> those in each shard to apply filters), it seems like there might still be
>>>> the long open connection issue above. What is your thought on that?
>>>>
>>>>
>>>>
>>>> -Jack
>>>>
>>>>
>>>>
>>>> On Sun, Dec 10, 2023 at 10:27 AM Ryan Blue <[email protected]> wrote:
>>>>
>>>> Rahil, thanks for working on this. It has some really good ideas that
>>>> we hadn't considered before like a way for the service to plan how to break
>>>> up the work of scan planning. I really like that idea because it makes it
>>>> much easier for the service to keep memory consumption low across requests.
>>>>
>>>>
>>>>
>>>> My primary feedback is that I think it's a little too complicated (with
>>>> both sharding and pagination) and could be modified slightly so that the
>>>> service doesn't need to be stateful. If the service isn't necessarily
>>>> stateful then it should be easier to build implementations.
>>>>
>>>>
>>>>
>>>> To make it possible for the service to be stateless, I'm proposing that
>>>> rather than creating shard IDs that are tracked by the service, the
>>>> information for a shard can be sent to the client. My assumption here is
>>>> that most implementations would create shards by reading the manifest list,
>>>> filtering on partition ranges, and creating a shard for some reasonable
>>>> size of manifest content. For example, if a table has 100MB of metadata in
>>>> 25 manifests that are about 4 MB each, then it might create 9 shards with
>>>> 1-4 manifests each. The service could send those shards to the client as a
>>>> list of manifests to read and the client could send the shard information
>>>> back to the service to get the data files in each shard (along with the
>>>> original filter).
>>>>
>>>>
>>>>
>>>> There's a slight trade-off that the protocol needs to define how to
>>>> break the work into shards. I'm interested in hearing if that would work
>>>> with how you were planning on building the service on your end. Another
>>>> option is to let the service send back arbitrary JSON that would get
>>>> returned for each shard. Either way, I like that this would make it so the
>>>> service doesn't need to persist anything. We could also make it so that
>>>> small tables don't require multiple requests. For example, a client could
>>>> call the route to get file tasks with just a filter.
>>>>
>>>>
>>>>
>>>> What do you think?
>>>>
>>>>
>>>>
>>>> Ryan
>>>>
>>>>
>>>>
>>>> On Fri, Dec 8, 2023 at 10:41 AM Chertara, Rahil
>>>> <[email protected]> wrote:
>>>>
>>>> Hi all,
>>>>
>>>> My name is Rahil Chertara, and I’m a part of the Iceberg team at Amazon
>>>> EMR and Athena. I’m reaching out to share a proposal for a new Scan API
>>>> that will be utilized by the RESTCatalog. The process for table scan
>>>> planning is currently done within client engines such as Apache Spark. By
>>>> moving scan functionality to the RESTCatalog, we can integrate Iceberg
>>>> table scans with external services, which can lead to several benefits.
>>>>
>>>> For example, we can leverage caching and indexes on the server side to
>>>> improve planning performance. Furthermore, by moving this scan logic to the
>>>> RESTCatalog, non-JVM engines can integrate more easily. This all can be
>>>> found in the detailed proposal below. Feel free to comment, and add your
>>>> suggestions .
>>>>
>>>> Detailed proposal:
>>>> https://docs.google.com/document/d/1FdjCnFZM1fNtgyb9-v9fU4FwOX4An-pqEwSaJe8RgUg/edit#heading=h.cftjlkb2wh4h
>>>>
>>>> Github POC: https://github.com/apache/iceberg/pull/9252
>>>>
>>>> Regards,
>>>>
>>>> Rahil Chertara
>>>> Amazon EMR & Athena
>>>> [email protected]
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Ryan Blue
>>>>
>>>> Tabular
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Ryan Blue
>>>>
>>>> Tabular
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Ryan Blue
>>>>
>>>> Tabular
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Ryan Blue
>>>>
>>>> Tabular
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Ryan Blue
>>>>
>>>> Tabular
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Ryan Blue
>>>>
>>>> Tabular
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Ryan Blue
>>>>
>>>> Tabular
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Ryan Blue
>>>>
>>>> Tabular
>>>>
>>>
>
> --
> Ryan Blue
> Tabular
>

Re: Proposal for REST APIs for Iceberg table scans

Reply via email to