I agree with Dan. I'd rather have two endpoints instead of needing an option that changes the behavior entirely in the same route. I don't think that a `preplan` route would be too bad.
On Wed, Jan 31, 2024 at 9:51 AM Daniel Weeks <[email protected]> wrote: > I agree with the opaque tokens. > > However, I'm concerned we're overloading the endpoint two perform two > distinctly different operations: distribute a plan and scan a plan. > > Changing the task-type then changes the behavior and the result. I feel > it would be more straightforward to separate the distribute and scan > endpoints. Then clients can call the scan directly if they do not know how > to distribute and the behavior is clear from the REST Specification. > > -Dan > > On Tue, Jan 30, 2024 at 9:09 PM Jack Ye <[email protected]> wrote: > >> +1 for having the opaque plan tasks, that's probably the most flexible >> way forward. And let's call them *plan tasks* going forward to >> standardize the terminology. >> >> I think the name of the APIs can be determined based on the actual API >> shape. For example, if we centralize these 2 plan and pre-plan actions to a >> single API endpoint but just requesting different task types: >> >> >> *pre-plan: POST /v1/namespaces/ns/tables/t/plan*{ "filter": { "type": >> "in", "term": "x", "values": [1, 2, 3] }, "select": ["x", "a.b"], >> "task-type": "plan-task" } >> >> { "plan-tasks": [ { ... }, { ... } ] } >> >> >> *plan without a plan-task: POST /v1/namespaces/ns/tables/t/plan*{ >> "filter": {"type": "in", "term": "x", "values": [1, 2, 3] }, "select": >> ["x", "a.b"], "task-type": "file-scan-task" } // file-scan-task should be >> the default type >> >> { "file-scan-tasks": [ { ... }, { ... } ] } >> >> >> *plan with a plan-task: POST /v1/namespaces/ns/tables/t/plan*{ "filter": >> {"type": "in", "term": "x", "values": [1, 2, 3] }, "select": ["x", "a.b"], >> "task-type": "file-scan-task", "plan-task": { ... } } >> >> { "file-scan-tasks": [...] } >> >> In this model, we just have a single API, and we can call it something >> like PlanTable or PlanTableScan. >> >> What do you think? >> >> -Jack >> >> >> On Mon, Jan 29, 2024 at 6:17 PM Renjie Liu <[email protected]> >> wrote: >> >>> But to move forward, I think we should go with the option that preserves >>>> flexibility. I think the spec should state that plan tasks (if we call them >>>> that) are a JSON object that should be sent as-is back to the REST service >>>> to be used. >>> >>> >>> +1 for this. >>> >>> > One more thing that I would also change is that I don't think the >>> "plan" and "scan" endpoints make much sense. We refer to the "scan" portion >>> of this as "planFiles" in the reference implementation, and "scan" is used >>> for actually reading data. To be less confusing, I think that file scan >>> tasks should be returned by a "plan" endpoint and the manifest plan tasks >>> (or shards) should be returned by a "pre-plan" endpoint. Does anyone else >>> like the names "pre-plan" and "plan" better? >>> >>> I agree that "scan" may be quite confusing since it's actually planning >>> file scan. Another options I can provide is: "plan" -> "plan-table-scan", >>> "scan" -> "plan-file-scan" >>> >>> >>> >>> >>> On Tue, Jan 30, 2024 at 9:03 AM Ryan Blue <[email protected]> wrote: >>> >>>> As you noted the main point we still need to decide on is whether to >>>> have a standard "shard" definition (e.g. manifest plan task) or to allow it >>>> to be opaque and specific to catalogs implementing the protocol. I've not >>>> replied because I keep coming back to this decision and I'm not sure >>>> whether the advantage is being clear about how it works (being explicit) or >>>> allowing implementations to differ (opaque). I'm skeptical that there will >>>> be other strategies. >>>> >>>> But to move forward, I think we should go with the option that >>>> preserves flexibility. I think the spec should state that plan tasks (if we >>>> call them that) are a JSON object that should be sent as-is back to the >>>> REST service to be used. >>>> >>>> One more thing that I would also change is that I don't think the >>>> "plan" and "scan" endpoints make much sense. We refer to the "scan" portion >>>> of this as "planFiles" in the reference implementation, and "scan" is used >>>> for actually reading data. To be less confusing, I think that file scan >>>> tasks should be returned by a "plan" endpoint and the manifest plan tasks >>>> (or shards) should be returned by a "pre-plan" endpoint. Does anyone else >>>> like the names "pre-plan" and "plan" better? >>>> >>>> Ryan >>>> >>>> On Mon, Jan 29, 2024 at 12:02 PM Chertara, Rahil >>>> <[email protected]> wrote: >>>> >>>>> Hi All hope everyone is doing well, >>>>> >>>>> >>>>> Wanted to revive the discussion around the Rest Table Scan API work. >>>>> For a refresher here is the original proposal: >>>>> https://docs.google.com/document/d/1FdjCnFZM1fNtgyb9-v9fU4FwOX4An-pqEwSaJe8RgUg/edit#heading=h.cftjlkb2wh4h >>>>> as well as the PR: https://github.com/apache/iceberg/pull/9252 >>>>> >>>>> >>>>> From the last messages on the thread, I believe Ryan and Jack were in >>>>> favor of having two distinct api endpoints /plan and /scan, as well as a >>>>> stricter json definition for the "shard”, here is an example below from >>>>> what was discussed. >>>>> >>>>> >>>>> *POST /v1/namespaces/ns/tables/t/plan *{ "filter": { "type": "in", >>>>> "term": "x", "values": [1, 2, 3] }, "select": ["x", "a.b"]} >>>>> >>>>> { "manifest-plan-tasks": [ >>>>> { "start": 0, "length": 1000, "manifest": { "path": >>>>> "s3://some/manifest.avro", ...}, "delete-manifests": [...] }, >>>>> { ... } >>>>> ]} >>>>> >>>>> >>>>> *POST /v1/namespaces/ns/tables/t/scan *{ "filter": {"type": "in", >>>>> "term": "x", "values": [1, 2, 3] }, >>>>> >>>>> "select": ["x", "a.b"], >>>>> "manifest-plan-task": { "start": 0, "length": 1000, "manifest": { >>>>> "path": "s3://some/manifest.avro", ...}, "delete-manifests": [...] } } >>>>> >>>>> { "file-scan-tasks": [...] } >>>>> >>>>> >>>>> *POST /v1/namespaces/ns/tables/t/scan *{ "filter": {"type": "in", >>>>> "term": "x", "values": [1, 2, 3] }, "select": ["x", "a.b"]} >>>>> >>>>> >>>>> { "file-scan-tasks": [...] } >>>>> >>>>> >>>>> >>>>> However IIRC Micah and Renjie had some concerns around this stricter >>>>> structure as this can make it harder to evolve in the future, as well as >>>>> some potential scalability challenges for larger tables that have many >>>>> manifest files. (Feel free to expand further on the concerns if my >>>>> understanding is incorrect). >>>>> >>>>> >>>>> >>>>> Would appreciate if the community can leave any more thoughts/feedback >>>>> on this thread, as well as on the google doc, and the PR. >>>>> >>>>> >>>>> >>>>> Regards, >>>>> Rahil Chertara >>>>> >>>>> >>>>> >>>>> *From: *Renjie Liu <[email protected]> >>>>> *Reply-To: *"[email protected]" <[email protected]> >>>>> *Date: *Thursday, December 21, 2023 at 10:35 PM >>>>> *To: *"[email protected]" <[email protected]> >>>>> *Subject: *RE: [EXTERNAL] Proposal for REST APIs for Iceberg table >>>>> scans >>>>> >>>>> >>>>> >>>>> *CAUTION*: This email originated from outside of the organization. Do >>>>> not click links or open attachments unless you can confirm the sender and >>>>> know the content is safe. >>>>> >>>>> >>>>> >>>>> I share the same concern with Micah. The shard detail should be >>>>> implementation details of the server, rather than exposing directly to the >>>>> client. If the goal is to make things stateless, we just need to attach a >>>>> snapshot id + shard id, then a determined algorithm is supposed to give >>>>> the >>>>> same result. Also another concern is for huge analytics tables, we may >>>>> have >>>>> a lot of manifest files, which may lead to large traffic from the rest >>>>> server. >>>>> >>>>> >>>>> >>>>> On Thu, Dec 21, 2023 at 7:41 AM Micah Kornfield <[email protected]> >>>>> wrote: >>>>> >>>>> Also +1 for having a more strict definition of the shard. Having >>>>> arbitrary JSON was basically what we experimented with a string shard ID, >>>>> and we ended up with something very similar to the manifest plan task you >>>>> describe in the serialized ID string. >>>>> >>>>> >>>>> >>>>> IIUC the proposal correctly, I'd actually be -0.0 on the stricter >>>>> structure. I think forcing a contract where it isn't strictly necessary >>>>> makes it harder to evolve the system in the future. For example it makes >>>>> it harder to address potential scalability problems in a transparent way >>>>> (e.g. extreme edge cases in cardinality between manifest files and delete >>>>> files). >>>>> >>>>> >>>>> >>>>> It also seems like it might overly constrain implementations (it is >>>>> not clear we should need to compute the mapping between delete file >>>>> manifests to data file manifests up front to start planning). >>>>> >>>>> >>>>> >>>>> On Tue, Dec 19, 2023 at 2:10 PM Jack Ye <[email protected]> wrote: >>>>> >>>>> +1 for having /plan and /scan, sounds like a good idea to separate >>>>> those 2 distinct actions. >>>>> >>>>> >>>>> >>>>> Also +1 for having a more strict definition of the shard. Having >>>>> arbitrary JSON was basically what we experimented with a string shard ID, >>>>> and we ended up with something very similar to the manifest plan task you >>>>> describe in the serialized ID string. >>>>> >>>>> >>>>> >>>>> So sounds like we are converging to the following APIs: >>>>> >>>>> >>>>> >>>>> >>>>> *POST /v1/namespaces/ns/tables/t/plan *{ "filter": { "type": "in", >>>>> "term": "x", "values": [1, 2, 3] }, "select": ["x", "a.b"]} >>>>> >>>>> { "manifest-plan-tasks": [ >>>>> { "start": 0, "length": 1000, "manifest": { "path": >>>>> "s3://some/manifest.avro", ...}, "delete-manifests": [...] }, >>>>> { ... } >>>>> ]} >>>>> >>>>> >>>>> *POST /v1/namespaces/ns/tables/t/scan *{ "filter": {"type": "in", >>>>> "term": "x", "values": [1, 2, 3] }, >>>>> >>>>> "select": ["x", "a.b"], >>>>> "manifest-plan-task": { "start": 0, "length": 1000, "manifest": { >>>>> "path": "s3://some/manifest.avro", ...}, "delete-manifests": [...] } } >>>>> >>>>> { "file-scan-tasks": [...] } >>>>> >>>>> >>>>> *POST /v1/namespaces/ns/tables/t/scan *{ "filter": {"type": "in", >>>>> "term": "x", "values": [1, 2, 3] }, "select": ["x", "a.b"]} >>>>> >>>>> >>>>> { "file-scan-tasks": [...] } >>>>> >>>>> >>>>> >>>>> If this sounds good overall, we can update the prototype to have more >>>>> detailed discussions in code. >>>>> >>>>> >>>>> >>>>> -Jack >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Thu, Dec 14, 2023 at 6:10 PM Ryan Blue <[email protected]> wrote: >>>>> >>>>> The tasks might look something like this: >>>>> >>>>> >>>>> >>>>> CombinedPlanTask >>>>> >>>>> - List<ManifestPlanTask> >>>>> >>>>> >>>>> >>>>> ManifestPlanTask >>>>> >>>>> - int start >>>>> >>>>> - int length >>>>> >>>>> - ManifestFile dataManifest >>>>> >>>>> - List<ManifestFile> deleteManifests >>>>> >>>>> >>>>> >>>>> On Thu, Dec 14, 2023 at 4:07 PM Ryan Blue <[email protected]> wrote: >>>>> >>>>> Seems like that track has expired (This Internet-Draft will expire on >>>>> 13 May 2022) >>>>> >>>>> Yeah, looks like we should just use POST. That’s too bad. QUERY seems >>>>> like a good idea to me. >>>>> >>>>> Distinguish planning using shard or not >>>>> >>>>> I think this was a mistake on my part. I was still thinking that we >>>>> would have a different endpoint for first-level planning to produce shards >>>>> and the route to actually get files. Since both are POST requests with the >>>>> same path (/v1/namespaces/ns/tables/t/scans) that no longer works. >>>>> What about /v1/namespaces/ns/tables/t/scan and >>>>> /v1/namespaces/ns/tables/t/plan? The latter could use some variant of >>>>> planFiles since that’s what we are wrapping in the Java API. >>>>> >>>>> Necessity of scan ID >>>>> >>>>> Yes, I agree. If you have shard IDs then you don’t really need a scan >>>>> ID. You could always have one internally but send it as part of the shard >>>>> ID. >>>>> >>>>> Shape of shard payload >>>>> >>>>> I think we have 2 general options depending on how strict we want to >>>>> be. >>>>> >>>>> 1. Require a standard shard definition >>>>> 2. Allow arbitrary JSON and leave it to the service >>>>> >>>>> I lean toward the first option, which would be a data manifest and the >>>>> associated delete manifests for the partition. We could also extend that >>>>> to >>>>> a group of manifests, each with a list of delete manifests. And we could >>>>> also allow splitting to ensure tasks don’t get too large with big files. >>>>> This all looks basically like FileScanTask, but with manifests and delete >>>>> manifests. >>>>> >>>>> >>>>> >>>>> On Wed, Dec 13, 2023 at 4:39 PM Jack Ye <[email protected]> wrote: >>>>> >>>>> Seems like that track has expired (This Internet-Draft will expire on >>>>> 13 May 2022), not sure how these RFCs are managed, but it does not seem >>>>> hopeful to have this verb in. I think people are mostly using POST for >>>>> this >>>>> use case already. >>>>> >>>>> >>>>> >>>>> But overall I think we are in agreement with the general direction. A >>>>> few detail discussions: >>>>> >>>>> >>>>> >>>>> *Distinguish planning using shard or not* >>>>> >>>>> Maybe we should add a query parameter like *distributed=true* to >>>>> distinguish your first and third case, since they are now sharing the same >>>>> signature. If the requester wants to use distributed planning, then some >>>>> sharding strategy is provided as a response for the requester to send more >>>>> detailed requests. >>>>> >>>>> *Necessity of scan ID* >>>>> In this approach, is scan ID still required? Because the shard payload >>>>> already fully describes the information to retrieve, it seems like we can >>>>> just drop the *scan-id* query parameter in the second case. Seems >>>>> like it's kept for the case if we still want to persist some state, but it >>>>> seems like we can make a stateless style fully working. >>>>> >>>>> *Shape of shard payload* >>>>> What do you think is necessary information of the shard payload? It >>>>> seems like we need at least the location of the manifests, plus the delete >>>>> manifests or delete files associated with the manifests. I like the idea >>>>> of >>>>> making it a "shard task" that is similar to a file scan task, and it might >>>>> allow us to return a mixture of both types of tasks, so we can have better >>>>> control of the response size. >>>>> >>>>> -Jack >>>>> >>>>> >>>>> >>>>> On Wed, Dec 13, 2023 at 3:50 PM Ryan Blue <[email protected]> wrote: >>>>> >>>>> I just changed it to POST after looking into support for the QUERY >>>>> method. It's a new HTTP method for cases like this where you don't want to >>>>> pass everything through query params. Here's the QUERY method RFC >>>>> <https://www.ietf.org/archive/id/draft-ietf-httpbis-safe-method-w-body-02.html>, >>>>> but I guess it isn't finalized yet? >>>>> >>>>> >>>>> >>>>> Just read them like you would a POST request that doesn't actually >>>>> create anything. >>>>> >>>>> >>>>> >>>>> On Wed, Dec 13, 2023 at 3:45 PM Jack Ye <[email protected]> wrote: >>>>> >>>>> Thanks, the Gist explains a lot of things. This is actually very close >>>>> to our way of implementing the shard ID, we were defining the shard ID as >>>>> a >>>>> string, and the string content is actually something similar to the >>>>> information of the JSON payload you showed, so we can persist minimum >>>>> information in storage. >>>>> >>>>> >>>>> >>>>> Just one clarification needed for your Gist: >>>>> >>>>> >>>>> >>>>> > QUERY /v1/namespaces/ns/tables/t/scans?scan-id=1 >>>>> >>>>> >>>>> >>>>> > { "shard": { "id": 1, "manifests": ["C"] }, "filter": {"type": "in", >>>>> "term": "x", "values": [1, 2, 3] } } >>>>> >>>>> >>>>> >>>>> > >>>>> >>>>> > { "file-scan-tasks": [...] } >>>>> >>>>> >>>>> >>>>> Here, what does this QUERY verb mean? Is that a GET? If it's GET, we >>>>> cannot have a request body. That's actually why we expressed that as an ID >>>>> string, since we can put it as a query parameter. >>>>> >>>>> >>>>> >>>>> -Jack >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Wed, Dec 13, 2023 at 3:25 PM Ryan Blue <[email protected]> wrote: >>>>> >>>>> Jack, >>>>> >>>>> It sounds like what I’m proposing isn’t quite clear because your >>>>> initial response was arguing for a sharding capability. I agree that >>>>> sharding is a good idea. I’m less confident about two points: >>>>> >>>>> 1. Requiring that the service is stateful. As Renjie pointed out, >>>>> that makes it harder to scale the service. >>>>> 2. The need for both pagination *and* sharding as separate things >>>>> >>>>> And I also think that Fokko has a good point about trying to keep >>>>> things simple and not requiring the CreateScan endpoint. >>>>> >>>>> For the first point, I’m proposing that we still have a CreateScan >>>>> endpoint, but instead of sending only a list of shard IDs it can also send >>>>> either a standard shard “task” or an optional JSON definition. Let’s >>>>> assume >>>>> we can send arbitrary JSON for an example. Say I have a table with 4 >>>>> manifests, A through D and that C and D match some query filter. When >>>>> I call the CreateScan endpoint, the service would send back tasks >>>>> with that information: {"id": 1, "manifests": ["C"]}, {"id": 2, >>>>> "manifests": ["D"]}. By sending what the shards mean (the manifests >>>>> to read), my service can be stateless: any node can get a request for >>>>> shard >>>>> 1, read manifest C, and send back the resulting data files. >>>>> >>>>> I don’t see much of an argument against doing this *in principle*. It >>>>> gives you the flexibility to store state if you choose or to send state to >>>>> the client for it to pass back when calling the GetTasks endpoint. >>>>> There is a practical problem, which is that it’s annoying to send a GET >>>>> request with a JSON payload because you can’t send a request body. It’s >>>>> probably obvious, but I’m also not a REST purist so I’d be fine using POST >>>>> or QUERY for this. It would look something like this Gist >>>>> <https://gist.github.com/rdblue/d2b65bd2ad20f85ee9d04ccf19ac8aba>. >>>>> >>>>> In your last reply, you also asked whether a stateless service is a >>>>> goal. I don’t think that it is, but if we can make simple changes to the >>>>> spec to allow more flexibility on the server side, I think that’s a good >>>>> direction. You also asked about a reference implementation and I consider >>>>> CatalogHandlers >>>>> <https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/rest/CatalogHandlers.java> >>>>> to be that reference. It does everything except for the work done by your >>>>> choice of web application framework. It isn’t stateless, but it only >>>>> relies >>>>> on a Catalog implementation for persistence. >>>>> >>>>> For the second point, I don’t understand why we need both sharding and >>>>> pagination. That is, if we have a protocol that allows sharding, why is >>>>> pagination also needed? From my naive perspective on how sharding would >>>>> work, we should be able to use metadata from the manifest list to limit >>>>> the >>>>> potential number of data files in a given shard. As long as we can limit >>>>> the size of a shard to produce more, pagination seems like unnecessary >>>>> complication. >>>>> >>>>> Lastly, for Fokko’s point, I think another easy extension to the >>>>> proposal is to support a direct call to GetTasks. There’s a trade-off >>>>> here, but if you’re already sending the original filter along with the >>>>> request (in order to filter records from manifest C for instance) >>>>> then the request is already something the protocol can express. There’s an >>>>> objection concerning resource consumption on the service and creating >>>>> responses that are too large or take too long, but we can get around that >>>>> by responding with a code that instructs the client to use the >>>>> CreateScan API like 413 (Payload too large). I think that would allow >>>>> simple clients to function for all but really large tables. The gist above >>>>> also shows what this might look like. >>>>> >>>>> Ryan >>>>> >>>>> >>>>> >>>>> On Wed, Dec 13, 2023 at 11:53 AM Jack Ye <[email protected]> wrote: >>>>> >>>>> The current proposal definitely makes the server stateful. In our >>>>> prototype we used other components like DynamoDB to keep track of states. >>>>> If keeping it stateless is a tenant we can definitely make the proposal >>>>> closer to that direction. Maybe one thing to make sure is, is this a core >>>>> tenant of the REST spec? Today we do not even have an official reference >>>>> implementation of the REST server, I feel it is hard to say what are the >>>>> core tenants. Maybe we should create one? >>>>> >>>>> >>>>> >>>>> Pagination is a common issue in the REST spec. We also see similar >>>>> limitations with other APIs like GetTables, GetNamespaces. When a catalog >>>>> has many namespaces and tables it suffers from the same issue. It is also >>>>> not ideal for use cases like web browsers, since typically you display a >>>>> small page of results and do not need the full list immediately. So I feel >>>>> we cannot really avoid some state to be kept for those use cases. >>>>> >>>>> >>>>> >>>>> Chunked response might be a good way to work around it. We also >>>>> thought about using HTTP2. However, these options seem to be not very >>>>> compatible with OpenAPI. We can do some further research in this domain, >>>>> would really appreciate it if anyone has more insights and experience with >>>>> OpenAPI that can provide some suggestions. >>>>> >>>>> >>>>> >>>>> -Jack >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Tue, Dec 12, 2023 at 6:21 PM Renjie Liu <[email protected]> >>>>> wrote: >>>>> >>>>> Hi, Rahi and Jack: >>>>> >>>>> Thanks for raising this. >>>>> >>>>> >>>>> >>>>> My question is that the pagination and sharding will make the rest >>>>> server stateful, e.g. a sequence of calls is required to go to the same >>>>> server. In this case, how do we ensure the scalability of the rest server? >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Wed, Dec 13, 2023 at 4:09 AM Fokko Driesprong <[email protected]> >>>>> wrote: >>>>> >>>>> Hey Rahil and Jack, >>>>> >>>>> >>>>> >>>>> Thanks for bringing this up. Ryan and I also discussed this briefly in >>>>> the early days of PyIceberg and it would have helped a lot in the speed of >>>>> development. We went for the traditional approach because that would also >>>>> support all the other catalogs, but now that the REST catalog is taking >>>>> off, I think it still makes a lot of sense to get it in. >>>>> >>>>> >>>>> >>>>> I do share the concern raised Ryan around the concepts of shards and >>>>> pagination. For PyIceberg (but also for Go, Rust, and DuckDB) that are >>>>> living in a single process today the concept of shards doesn't add value. >>>>> I >>>>> see your concern with long-running jobs, but for the non-distributed >>>>> cases, >>>>> it will add additional complexity. >>>>> >>>>> >>>>> >>>>> Some suggestions that come to mind: >>>>> >>>>> - Stream the tasks directly back using a chunked >>>>> response, reducing the latency to the first task. This would also solve >>>>> things with the pagination. The only downside I can think of is having >>>>> delete files where you first need to make sure there are deletes >>>>> relevant >>>>> to the task, this might increase latency to the first task. >>>>> - Making the sharding optional. If you want to shard you call the >>>>> CreateScan first and then call the GetScanTask with the IDs. If you >>>>> don't >>>>> want to shard, you omit the shard parameter and fetch the tasks >>>>> directly >>>>> (here we need also replace the scan string with the full >>>>> column/expression/snapshot-id etc). >>>>> >>>>> Looking forward to discussing this tomorrow in the community sync >>>>> <https://iceberg.apache.org/community/#iceberg-community-events>! >>>>> >>>>> >>>>> >>>>> Kind regards, >>>>> >>>>> Fokko >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> Op ma 11 dec 2023 om 19:05 schreef Jack Ye <[email protected]>: >>>>> >>>>> Hi Ryan, thanks for the feedback! >>>>> >>>>> >>>>> >>>>> I was a part of this design discussion internally and can provide more >>>>> details. One reason for separating the CreateScan operation was to make >>>>> the >>>>> API asynchronous and thus keep HTTP communications short. Consider the >>>>> case >>>>> where we only have GetScanTasks API, and there is no shard specified. It >>>>> might take tens of seconds, or even minutes to read through all the >>>>> manifest list and manifests before being able to return anything. This >>>>> means the HTTP connection has to remain open during that period, which is >>>>> not really a good practice in general (consider connection failure, load >>>>> balancer and proxy load, etc.). And when we shift the API to asynchronous, >>>>> it basically becomes something like the proposal, where a stateful ID is >>>>> generated to be able to immediately return back to the client, and the >>>>> client get results by referencing the ID. So in our current prototype >>>>> implementation we are actually keeping this ID and the whole REST service >>>>> is stateful. >>>>> >>>>> >>>>> >>>>> There were some thoughts we had about the possibility to define a >>>>> "shard ID generator" protocol: basically the client agrees with the >>>>> service >>>>> a way to deterministically generate shard IDs, and service uses it to >>>>> create shards. That sounds like what you are suggesting here, and it >>>>> pushes >>>>> the responsibility to the client side to determine the parallelism. But in >>>>> some bad cases (e.g. there are many delete files and we need to read all >>>>> those in each shard to apply filters), it seems like there might still be >>>>> the long open connection issue above. What is your thought on that? >>>>> >>>>> >>>>> >>>>> -Jack >>>>> >>>>> >>>>> >>>>> On Sun, Dec 10, 2023 at 10:27 AM Ryan Blue <[email protected]> wrote: >>>>> >>>>> Rahil, thanks for working on this. It has some really good ideas that >>>>> we hadn't considered before like a way for the service to plan how to >>>>> break >>>>> up the work of scan planning. I really like that idea because it makes it >>>>> much easier for the service to keep memory consumption low across >>>>> requests. >>>>> >>>>> >>>>> >>>>> My primary feedback is that I think it's a little too complicated >>>>> (with both sharding and pagination) and could be modified slightly so that >>>>> the service doesn't need to be stateful. If the service isn't necessarily >>>>> stateful then it should be easier to build implementations. >>>>> >>>>> >>>>> >>>>> To make it possible for the service to be stateless, I'm proposing >>>>> that rather than creating shard IDs that are tracked by the service, the >>>>> information for a shard can be sent to the client. My assumption here is >>>>> that most implementations would create shards by reading the manifest >>>>> list, >>>>> filtering on partition ranges, and creating a shard for some reasonable >>>>> size of manifest content. For example, if a table has 100MB of metadata in >>>>> 25 manifests that are about 4 MB each, then it might create 9 shards with >>>>> 1-4 manifests each. The service could send those shards to the client as a >>>>> list of manifests to read and the client could send the shard information >>>>> back to the service to get the data files in each shard (along with the >>>>> original filter). >>>>> >>>>> >>>>> >>>>> There's a slight trade-off that the protocol needs to define how to >>>>> break the work into shards. I'm interested in hearing if that would work >>>>> with how you were planning on building the service on your end. Another >>>>> option is to let the service send back arbitrary JSON that would get >>>>> returned for each shard. Either way, I like that this would make it so the >>>>> service doesn't need to persist anything. We could also make it so that >>>>> small tables don't require multiple requests. For example, a client could >>>>> call the route to get file tasks with just a filter. >>>>> >>>>> >>>>> >>>>> What do you think? >>>>> >>>>> >>>>> >>>>> Ryan >>>>> >>>>> >>>>> >>>>> On Fri, Dec 8, 2023 at 10:41 AM Chertara, Rahil >>>>> <[email protected]> wrote: >>>>> >>>>> Hi all, >>>>> >>>>> My name is Rahil Chertara, and I’m a part of the Iceberg team at >>>>> Amazon EMR and Athena. I’m reaching out to share a proposal for a new Scan >>>>> API that will be utilized by the RESTCatalog. The process for table scan >>>>> planning is currently done within client engines such as Apache Spark. By >>>>> moving scan functionality to the RESTCatalog, we can integrate Iceberg >>>>> table scans with external services, which can lead to several benefits. >>>>> >>>>> For example, we can leverage caching and indexes on the server side to >>>>> improve planning performance. Furthermore, by moving this scan logic to >>>>> the >>>>> RESTCatalog, non-JVM engines can integrate more easily. This all can be >>>>> found in the detailed proposal below. Feel free to comment, and add your >>>>> suggestions . >>>>> >>>>> Detailed proposal: >>>>> https://docs.google.com/document/d/1FdjCnFZM1fNtgyb9-v9fU4FwOX4An-pqEwSaJe8RgUg/edit#heading=h.cftjlkb2wh4h >>>>> >>>>> Github POC: https://github.com/apache/iceberg/pull/9252 >>>>> >>>>> Regards, >>>>> >>>>> Rahil Chertara >>>>> Amazon EMR & Athena >>>>> [email protected] >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> >>>>> Ryan Blue >>>>> >>>>> Tabular >>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> >>>>> Ryan Blue >>>>> >>>>> Tabular >>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> >>>>> Ryan Blue >>>>> >>>>> Tabular >>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> >>>>> Ryan Blue >>>>> >>>>> Tabular >>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> >>>>> Ryan Blue >>>>> >>>>> Tabular >>>>> >>>>> >>>> >>>> -- >>>> Ryan Blue >>>> Tabular >>>> >>> -- Ryan Blue Tabular
