> evaluate a REST service and if it's good for their use case Feels like you are talking from the perspective of people choosing a vendor product. I believe most vendors will offer near-full capability. But I am coming from an angle of small organizations that are building REST servers just for opening connectivity of internal systems to open engines, and I know there are many doing that right now. And all they need is technically just a handful of APIs, many of them will be read-only, very limited in auxiliary features.
And I believe this is also the current state of Databricks Unity, which only implements the following: - getConfig - listNamespaces - loadNamespaceMetadata - listTables - loadTable - tableExists - emitMetrics Code ref: https://github.com/unitycatalog/unitycatalog/blob/main/server/src/main/java/io/unitycatalog/server/service/IcebergRestCatalogService.java How do I express the capability of that? Maybe with your team joining Databricks, you will add full capabilities, but I think it is totally okay in its current state, it serves the main use cases very well. And feels like we are optimizing people writing 5 capabilities vs 20 capabilities, and we need to spend a lot of effort in debating how to group capabilities going forward, which I'd rather spend the energy doing other things... -Jack On Wed, Jun 26, 2024 at 1:41 PM Amogh Jahagirdar <amo...@apache.org> wrote: > I'm in favor of grouping by tags. The way I look at this, there are 2 > primary considerations: > > 1.) The client/server protocol complexity tradeoffs. On the first > consideration, unless I'm missing something the client side becomes > significantly more complex; if this has been sketched out earlier in this > thread just point me to it. Grouping by tag seems more easy to manage from > a client side but happy to be proven wrong here if that's not the case, > this is just going off whats in my head at the moment. > > 2.) What is more useful for end users of Iceberg to evaluate the > capabilities of a REST Server? > In the end as REST is becoming more adopted, I think it's more healthy for > the ecosystem > if the community can define groupings which end users can more easily > understand when trying to understand what a REST implementation can > actually do. To me defining "X operation/endpoint Version Y" for all > operations makes it needlessly more difficult for users to evaluate a REST > service and if it's good for their use case. Standardizing the grouping by > a simple tag name has a clear benefit to end users imo. > > > (1) There are many REST services that would only implement a very small > set > > of APIs, such as just loadTable and loadView. Some will choose to not > > implement very specific endpoints, such as renameTable. Tags seems > > convenient but it is mandating people to implement a specific group of > APIs > > together, which is a lot of burdens for especially small organizations, > if > > they just want to support very specific goals like reading through IRC. > > I can understand the concern here but going back to my earlier point, the > capability tagging really should be beneficial for end users which I think > optimizing for that is more important. It's true that for a REST server to > be considered "capability X compliant" it needs to implement all the > endpoints for X which does have a burden on a server implementation, but I > think that's net better for the ecosystem since a broader set of users have > a clear idea of what's supported and can make good decisions for themselves > since everyone is speaking the same standard language. > > Furthermore, we could also look at if it makes sense to make the tags more > granular if the scenario described is actually common. > > On 2024/06/26 16:26:29 Jack Ye wrote: > > It seems like there are 2 sub-topics here: > > 1. should we group operations with tags, or should we do this > > per-operation/endpoint? > > 2. how should we do the capability/versioning for each unit (either per > tag > > or per operation) > > > > Shall we first conclude on 1? > > > > For 1, my take is that we will need to do it per operation, for 2 > reasons: > > > > (1) There are many REST services that would only implement a very small > set > > of APIs, such as just loadTable and loadView. Some will choose to not > > implement very specific endpoints, such as renameTable. Tags seems > > convenient but it is mandating people to implement a specific group of > APIs > > together, which is a lot of burdens for especially small organizations, > if > > they just want to support very specific goals like reading through IRC. > > > > (2) Suppose a new tag is added in the future, the server returns that > tag, > > but an older client does not understand it, it might cause mistakes in > the > > client's understanding of what is supported and what is not, when a tag > > contains both features in existing APIs and also new APIs. If we define > > that tags do not overlap with each other, this is probably not a concern. > > However, (1) still is a problem from a usability perspective. > > > > Best, > > Jack Ye > > > > > > > > > > On Wed, Jun 26, 2024 at 9:02 AM Daniel Weeks <dwe...@apache.org> wrote: > > > > > I think Robert's approach is a reasonable compromise here. > > > > > > If we wanted a "per operation/endpoint" versioning, I think I'd prefer > > > Micah's OpenAPI spec based approach because it's more standardized, > but I > > > feel adds a lot of client complexity. > > > > > > -Dan > > > > > > > > > > > > On Wed, Jun 26, 2024 at 6:59 AM Robert Stupp <sn...@snazy.de> wrote: > > > > > >> (I think, compatibility deserves a separate thread - it's a "huge" > topic) > > >> > > >> Based on experience, we decided on the following with Nessie: > > >> > > >> - Unknown fields/attributes in a structure _DO_ cause > > >> (de)serialization failures. > > >> - "Stable API versions" - endpoint additions and/or added query > > >> parameters and/or enhanced structures do _NOT_ require a new API > version > > >> (as in the endpoint's route/path). > > >> - "Flexible spec versions" - new and updated "capabilities" however > > >> might cause a bump in the "spec version" that the server announces > in its > > >> `getConfig` result. > > >> > > >> Adding new routes/paths may require new endpoint implementations on > the > > >> server side, which can easily lead to a lot of (unnecessarily > boilerplate) > > >> code. Using different routes/paths is justified if the API is changed > > >> "fundamentally". We call the "path component" (api/v1/..., > api/v2/...) API > > >> version - the server indicates the minimum and maximum supported API > > >> version, in case a client wants to "upgrade". I recommend to _not_ > bump the > > >> API version in the route/path if it's not really necessary. > > >> > > >> Regarding the requirement to fail on unknown attributes: Unknown > > >> attributes may contain important information. A client may send a > newer > > >> version of a request object with an important new field, but the > (older) > > >> server discards the new attribute. Think of an attribute that for > example > > >> defines a "commit condition" that the client expects to be respected. > "New" > > >> attributes must be omittable (e.g. don't serialize if null/default) - > > >> clients indicate the "usage" of an added attribute using some request > > >> attribute (for example: "boolean returnExtendedInformation"). > > >> > > >> The list of capabilities can be indicated with included "spec > versions", > > >> to tell clients which features/functionalities a server > > >> supports."Production" spec versions could start with 1, and "reserve" > 0 for > > >> experimental/unsupported/poc kind of implementation. It could look > like > > >> this: > > >> capabilities: [ > > >> "table-spec/2,3", // but not table-spec v1 here > > >> "view-spec/1", > > >> "table-api/1", > > >> "view-api/1", > > >> "udf-api/1", > > >> "super-feature/2,4,6", // but not spec versions 0,1,3,5,7+ > > >> ... > > >> ] > > >> Incrementing a spec version in the list of capabilities doesn't break > any > > >> client. We could also define a structure to describe each capability: > > >> components: > > >> schemas: > > >> Capability: > > >> name: > > >> type: string > > >> description: Name of the capability > > >> versions: > > >> type: array: > > >> description: List of supported spec versions of this > > >> capability. 0 means experimental (non-production) without any > guarantees > > >> about the stability of schema for request and response parameters. > > >> items: > > >> type: integer > > >> format: int32 > > >> > > >> In Nessie, we ensure backwards and forwards compatibility using a > > >> specialized test suite that runs the "in tree" client against older > server > > >> versions and older client versions against the "in tree" server > version. It > > >> works fine for us for a few years now - and it did help preventing > > >> compatibility issues. > > >> > > >> > > >> On 26.06.24 07:44, Péter Váry wrote: > > >> > > >> Hi everyone, > > >> > > >> A few considerations: > > >> - I think we should explicitly state which client/service > > >> interoperability we are aiming for. I expect that we want to support > both > > >> old client -> new server, and new client -> old server communications. > > >> - I agree with Jack, that we should think about versions in advance - > HMS > > >> tried to be backwards compatible for everything, and that made it > hard to > > >> move forward / deprecate things. > > >> - Still we should try to keep the backwards incompatible changes > minimal. > > >> (All clients should be able to ignore unknown incoming fields / New > > >> optional input parameter should drive new features / Try to avoid > enums in > > >> responses where we expect changes (?)) > > >> - OTOH, it could be important for clients to know which of the > backwards > > >> compatible changes are implemented for the given server - so I would > > >> decouple the URI from the versioning. Maybe major version change > should > > >> (could) change the URI, but backwards compatible changes should be > served > > >> on the same URI, but could be identified by different minor versions. > > >> > > >> This is exciting stuff! > > >> Thanks for pushing this forward! > > >> > > >> Peter > > >> > > >> > > >> On Wed, Jun 26, 2024, 00:15 Jack Ye <yezhao...@gmail.com> wrote: > > >> > > >>> Hi everyone, > > >>> > > >>> I feel I do not see a good answer to why not just simply version each > > >>> API? When using tag, it means I have to offer capabilities per-tagged > > >>> group. However, I could for example just offer loadTable and nothing > else > > >>> in a catalog, and that should still be Iceberg REST compliant. And I > think > > >>> we need a versioning story anyway, there is no way around it. > > >>> > > >>> Here is the workflow in my mind with versioning: > > >>> > > >>> 1. Going forward, every time the REST catalog spec introduces any new > > >>> API endpoints or backwards incompatible changes to the existing > APIs, the > > >>> version of the specific API is incremented. So suppose the PlanTable > API is > > >>> added, this API will be at version v1. Suppose UpdateTable is > updated with > > >>> a new update type, that API will be at version v2, but PlanTable will > > >>> remain at v1. > > >>> > > >>> 2. a catalog must implement getConfig. This API is the only one that > is > > >>> required. > > >>> > > >>> 3. in getConfig, in the defaults map (it could be in some new > metadata > > >>> structure, but since we want strong backwards compatibility > guarantee, > > >>> reusing string maps seems to be the best way), server returns > key-value > > >>> pairs of: > > >>> - key: operation:<operationName> > > >>> - value: version number > > >>> > > >>> 4. the client assumes that the map is ordered, and resolves API > versions > > >>> sequentially. For example, suppose I have the following map: > > >>> > > >>> { "operation:planTable": "1", "operation:loadTable": "2" } > > >>> > > >>> Note that by "supporting", it means to return a response in a > > >>> predictable way that is compliant with the spec. It can also return > 406 > > >>> UnsupportedOperation as a way to support it. > > >>> > > >>> There is also a special version *, that means any version can work. > > >>> > > >>> 5. Backwards compatibility: suppose the client is at a higher version > > >>> than the server, then the client should always be able to understand > the > > >>> server's full list of capabilities. > > >>> > > >>> 6. Forward compatibility: suppose the client is at a lower version > than > > >>> the server, then the client should parse whatever operation it > understands, > > >>> and use the highest version it could support to execute the > operation. > > >>> Suppose the client only supports loadTable v1, then it will continue > to hit > > >>> the GET v1/namespaces/{ns}/tables/{table} route, instead of GET > > >>> v2/namespaces/{ns}/tables/{table}. The v1 route could continue to > support > > >>> the client, or it could throw 406 to indicate that this route is > deprecated > > >>> and the client needs to upgrade. > > >>> > > >>> For initial backwards compatibility, I think not returning anything > > >>> should mean that all API that the client understands are having > version *. > > >>> > > >>> What do people think of it, compared to the tag approach? > > >>> > > >>> Best, > > >>> Jack Ye > > >>> > > >>> > > >>> > > >>> > > >>> > > >>> > > >>> > > >>> > > >>> > > >>> > > >>> > > >>> > > >>> > > >>> > > >>> > > >>> > > >>> > > >>> > > >>> > > >>> On Mon, Jun 24, 2024 at 1:42 PM Micah Kornfield < > emkornfi...@gmail.com> > > >>> wrote: > > >>> > > >>>> I don't have strong opinions either way here, just thought it was > worth > > >>>> raising some concerns over possible evolution here. Some responses > inline, > > >>>> but if capabilities seem to meet the requirement at hand, then it > does > > >>>> potentially seem the simplest mechanism. > > >>>> > > >>>> > > >>>> I think we also want to avoid relyance on server specific published > > >>>>> OpenAPI as they may leak other options/parameters/etc. This may > lead to > > >>>>> confusion around what the canonical spec is and make clients > incompatible > > >>>>> if they're generated off of a non-standard spec document. > > >>>> > > >>>> > > >>>> Yeah, I wasn't proposing necessarily using built in functionality > but a > > >>>> pre-scrubbed document. Since there is no reference service > implementation > > >>>> for REST it seems like each implementor would need to describe the > best way > > >>>> of scrubbing there description. > > >>>> > > >>>> > > >>>> > > >>>>> @Micah this sounds to me as if the client would then have to parse > a > > >>>>> bunch of endpoints to figure out whether it's safe to e.g. call > loading a > > >>>>> view or dropping a table on the given REST server. Rather than > having a > > >>>>> dedicated endpoint we're just using the */config* endpoint to > provide > > >>>>> information about what a server supports. > > >>>> > > >>>> > > >>>> I was not suggesting multiple endpoints here, simply different > > >>>> contents for */config *I agree in the short term this does add > > >>>> complexity on the clients. But given that the canonical REST API > clients > > >>>> are being developed into the standard library, I'm not sure how > much toil > > >>>> this would cause in general. This also does not necessarily need to > called > > >>>> up-front but could be called to verify existence vs a permission > issue > > >>>> after an error was received. > > >>>> > > >>>> What round-trips did you have in mind here? > > >>>> > > >>>> > > >>>> All good points though, but I'm not aware of a standard way to > handle > > >>>>> this. > > >>>> > > >>>> > > >>>> IIUC, this sounds like a standard service description problem to me, > > >>>> the solution with capabilities appears to be one level abstraction > on top > > >>>> of this. Service discovery seems like it has been reimplemented a > few > > >>>> different times depending on the technology [1][2][3] > > >>>> > > >>>> > > >>>> I think versioning adds another level of complexity, but might be > > >>>>> necessary since I expect these will evolve to some extent and may > even > > >>>>> require hitting versioned urls. > > >>>> > > >>>> > > >>>> If there is no concrete proposal on versioning, I agree it probably > > >>>> pays to side step this. The endpoint transitioning from list of > strings to > > >>>> list of objects, would be an obvious sign to clients that they are > out of > > >>>> date. I think serving a service description(s), despite its > complexity, is > > >>>> likely the most principled way of versioning items appropriately, > but this > > >>>> definitely requires more in depth thought/design. > > >>>> > > >>>> > > >>>> Thanks, > > >>>> Micah > > >>>> > > >>>> [1] https://en.wikipedia.org/wiki/Web_Services_Description_Language > > >>>> [2] > https://en.wikipedia.org/wiki/Web_Application_Description_Language > > >>>> [3] https://developers.google.com/discovery/v1/reference/apis > > >>>> > > >>>> > > >>>> > > >>>> > > >>>> On Mon, Jun 24, 2024 at 12:42 PM Daniel Weeks <dwe...@apache.org> > > >>>> wrote: > > >>>> > > >>>>> Hey Micah, > > >>>>> > > >>>>> I think what we're trying to achieve is strike a balance between > > >>>>> client complexity and ability to support multiple server-side > > >>>>> capabilities. One challenge we've run into is if a client > performs an > > >>>>> operation (e.g. listViews), but receives a 403 code, it's not > clear whether > > >>>>> the client doesn't have access or the server doesn't support an > endpoint > > >>>>> but isn't sending a 404 for security reasons. This is a simple > way for the > > >>>>> client to understand what it should expect from the server. > > >>>>> > > >>>>> > Another option would be just list all endpoints . . . and let > > >>>>> clients take appropriate actions > > >>>>> > This could be done by vending the OpenAPI spec the server > supports > > >>>>> at its own endpoint. I think this avoids the future problem of > having to > > >>>>> classify new endpoints into a specific capability. > > >>>>> > > >>>>> You're right that this would be the most complete way to handle > this, > > >>>>> but it's really complicated and may require additional "handshake" > calls > > >>>>> even for small interactions with the catalog service. I think > this puts a > > >>>>> lot of onus on the client, when what we're describing is a set of > endpoints > > >>>>> that correspond to a capability. > > >>>>> > > >>>>> I think we also want to avoid relyance on server specific published > > >>>>> OpenAPI as they may leak other options/parameters/etc. This may > lead to > > >>>>> confusion around what the canonical spec is and make clients > incompatible > > >>>>> if they're generated off of a non-standard spec document. > > >>>>> > > >>>>> All good points though, but I'm not aware of a standard way to > handle > > >>>>> this. > > >>>>> > > >>>>> I think versioning adds another level of complexity, but might be > > >>>>> necessary since I expect these will evolve to some extent and may > even > > >>>>> require hitting versioned urls. > > >>>>> > > >>>>> -Dan > > >>>>> > > >>>>> > > >>>>> > > >>>>> > > >>>>> On Mon, Jun 24, 2024 at 12:03 AM Eduard Tudenhöfner < > > >>>>> etudenhoef...@apache.org> wrote: > > >>>>> > > >>>>>> We had a separate discussion with Dan on the *oauth2* flag last > week > > >>>>>> and came to the same conclusion that removing the *oauth2* > > >>>>>> capability is probably the best for now. > > >>>>>> This is mainly because we can't really act on the *oauth2* > > >>>>>> capability right now, because the */tokens* endpoint is called > > >>>>>> before we hit the */config* endpoint. > > >>>>>> > > >>>>>> > Another option would be just list all endpoints (and maybe even > > >>>>>> further which operations are supported) the server actually > supports and > > >>>>>> let clients take appropriate actions (i.e. grouping could happen > on the > > >>>>>> client side). This could be done by vending the OpenAPI spec the > server > > >>>>>> supports at its own endpoint. I think this avoids the future > problem of > > >>>>>> having to classify new endpoints into a specific capability. > > >>>>>> > > >>>>>> @Micah this sounds to me as if the client would then have to > parse a > > >>>>>> bunch of endpoints to figure out whether it's safe to e.g. call > loading a > > >>>>>> view or dropping a table on the given REST server. Rather than > having a > > >>>>>> dedicated endpoint we're just using the */config* endpoint to > > >>>>>> provide information about what a server supports. > > >>>>>> > > >>>>>> Thanks > > >>>>>> Eduard > > >>>>>> > > >>>>>> On Fri, Jun 21, 2024 at 8:27 PM Ryan Blue > > >>>>>> <b...@databricks.com.invalid> <b...@databricks.com.invalid> > wrote: > > >>>>>> > > >>>>>>> Let's remove the oauth2 tag for now until we figure out how to > move > > >>>>>>> forward there. That makes sense to me. > > >>>>>>> > > >>>>>>> On Fri, Jun 21, 2024 at 9:30 AM Dmitri Bourlatchkov > > >>>>>>> <dmitri.bourlatch...@dremio.com.invalid> > > >>>>>>> <dmitri.bourlatch...@dremio.com.invalid> wrote: > > >>>>>>> > > >>>>>>>> Hi Eduard, > > >>>>>>>> > > >>>>>>>> The capabilities PR looks good to me overall. I have a concern > with > > >>>>>>>> the "oauth2" tag name though. > > >>>>>>>> > > >>>>>>>> I also commented [1] in GH but the comment appears to be closed > by > > >>>>>>>> default :) > > >>>>>>>> > > >>>>>>>> I believe the term "oauth2" is confusing in this context with > > >>>>>>>> respect to RFC 6749 [2] as discussed in depth on another thread > [3] > > >>>>>>>> > > >>>>>>>> The functionality behind the /tokens endpoint is quite specific > to > > >>>>>>>> the Iceberg REST spec and as the other discussion highlights, > there are > > >>>>>>>> concerns with respect to OAuth2 interoperability with other > OAuth2 servers. > > >>>>>>>> > > >>>>>>>> What do you think about using a different tag name for it, for > > >>>>>>>> example "local-tokens" or "auth-tokens"? > > >>>>>>>> > > >>>>>>>> Thanks, > > >>>>>>>> Dmitri. > > >>>>>>>> > > >>>>>>>> [1] > > >>>>>>>> > https://github.com/apache/iceberg/pull/9940/files/15c769a52b85ac4deff5659978c7ffa7802612b0#r1649173934 > > >>>>>>>> [2] https://www.rfc-editor.org/rfc/rfc6749 > > >>>>>>>> [3] > > >>>>>>>> > https://lists.apache.org/thread/twk84xx7v0xy5q5tfd9x5torgr82vv50 > > >>>>>>>> > > >>>>>>>> On Thu, Jun 20, 2024 at 7:28 AM Eduard Tudenhoefner < > > >>>>>>>> etudenhoef...@apache.org> wrote: > > >>>>>>>> > > >>>>>>>>> Hey everyone, > > >>>>>>>>> > > >>>>>>>>> I'd like to bring up the discussion around describing REST > server > > >>>>>>>>> capabilities via the */config* endpoint. > > >>>>>>>>> There is PR #9940 <https://github.com/apache/iceberg/pull/9940> > that > > >>>>>>>>> describes the OpenAPI spec changes. > > >>>>>>>>> > > >>>>>>>>> Mainly we'd like to have a *capabilities* field in the > > >>>>>>>>> *ConfigResponse* that allows servers to indicate to clients > which > > >>>>>>>>> capabilities are being supported. > > >>>>>>>>> > > >>>>>>>>> So far we have the following capabilities: > > >>>>>>>>> > > >>>>>>>>> - tables > > >>>>>>>>> - views > > >>>>>>>>> - remote-signing > > >>>>>>>>> - vended-credentials > > >>>>>>>>> - multi-table-commit > > >>>>>>>>> - register-table > > >>>>>>>>> - table-metrics > > >>>>>>>>> - oauth2 > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> The general idea behind a capability is that if e.g. a server > > >>>>>>>>> supports *views*, then that server must implement all endpoints > > >>>>>>>>> grouped under that capability. > > >>>>>>>>> It's worth noting that the */config* endpoint is currently > being > > >>>>>>>>> implicit (meaning that every REST server would have to > implement it). > > >>>>>>>>> > > >>>>>>>>> One discussion point that came up during review is how we want > to > > >>>>>>>>> handle capabilities and backwards compatibility and what the > default > > >>>>>>>>> capability would be, since older servers don't know anything > about > > >>>>>>>>> *capabilities* (in such a case we could assume that the default > > >>>>>>>>> capabilities would be *oauth2* / *tables*). > > >>>>>>>>> > > >>>>>>>>> Are there any other capabilities that we'd like to include in > the > > >>>>>>>>> list? > > >>>>>>>>> > > >>>>>>>>> Eduard > > >>>>>>>>> > > >>>>>>>> > > >>>>>>> > > >>>>>>> -- > > >>>>>>> Ryan Blue > > >>>>>>> Databricks > > >>>>>>> > > >>>>>> -- > > >> Robert Stupp > > >> @snazy > > >> > > >> > > >