kbendick commented on pull request #3561:
URL: https://github.com/apache/iceberg/pull/3561#issuecomment-984286915
First, thank you for the extensive review @rymurr! I truly do appreciate it.
> A few general comments from me:
>
> 1. 412 feels weird. the fact that some people may have load balancers in
front and some load balancers standard config aren't clear about what 404 means
isn't a great reason for a non-standard overload of a return code.
For reference, the primary load balancers I've worked with when encountering
the want / desire to do this have been AWS provided. That's not meant as a dig
on AWS ELBs or ALBs, but just to say that it's not a niche set of load
balancers we're talking about. Amazon only advertise general 4xx and 5xx
counts, but it's pretty common to have metrics from the load balancer or
fronting servers (nginx, proxy layers) on specific codes out of the box and
it's pretty common for people to parse specific codes from ELB logs via means
that are typically outside of the normal observability stack (i.e. in-app /
in-client metrics or reported events, if any).
That's not necessarily reason to not use 404, but I did want to point out
that it likely affects most all of us. It's by no means a legacy issue or only
a concern for a small set of tools. It could just be a misconfigured server as
well. Or misconfigured DNS. A number of things can cause application metrics to
be missing (including metrics from any kind of mesh network - though you're
probably not in a good spot regardless then).
My other reason to avoid 404 is that it doesn't provide information about
what was not present - in some cases, a namespace or a specific table might not
be present from the same call and 404 for both requires additional tooling to
monitor that, especially client side. Having a dedicated code for each is
helpful there too. Using even specific codes in one response (409, 412, etc),
can indicate more rich information and so I'm partial to that and having it
documented vs a large number of 404s.
But there does seem to be pushback to this idea, so I'm willing to convert
to 404 if enough people feel strongly about it. I'm going to handle that later
as there's more feedback to cover, but will revert if people feel strongly.
> 2. Thinking about the API from a high level perspective (as opposed to a
mapping of existing java interfaces) it seems we have two entities to consider:
Catalog and Table. Catalog operations are mostly related to high level
concerns: namespaces, listings. properties, broad config etc and typically
don't require modification of iceberg metadata (therefore don't require
locking). Whereas tables require locking, modification of the filesystem/object
store etc.
I don't necessarily agree with the idea that Table requires locking and
Catalog doesn't, only because I don' think this new catalog should necessarily
prescribe how things are done either (just as the API doesn't currently fully
prescribe how a catalog can achieve certain things). Though I do see where
you're coming from and it's `Table`s that cause `Transaction`. So I don't
completely disagree by any means.
Also, in some scenarios locking might be needed on the catalog, such as
highly concurrent multi-tenant environments like Kyuubi or large Trino
deploymens etc.
I've avoided `Transaction` for now because of it in this PR. As well as
table creation. Those can be tackled separately.
> With this separation in mind (and the fact that we haven't specified in
the API how to guarantee atomicity) I thought keeping the catalog api (eg
`/v1/catalog`) to only deal w/ namespaces, properties etc may be a better
separation. We could move the CRUD operations for tables to `/v1/table` paths
and overload less the meaning of some of the operations here.
>
> Eg:
>
> 1. Post/Put/Delete/Get/Head on `v1/catalog/{namespace}` can refer to
create, update properties, delete, get properties and check namespace existence
> 2. Post/Put/Delete/Get/Head on `v1/table/{namespace}/{table}` can refer to
create, update table, delete, get table and check table existence
> Then 2. only has to deal w the atmoic CRUD operations and we override
much less.
I'm not opposed to adding more separation on the routes, but admittedly I've
moved a few of them back and forth a few times so I'll hold off momentarily and
focus on cleaning up other parts (especially types) and providing more examples
as well as working on the annotations for my own use in case we want to go that
route. I'm actually finding it slightly more effective to just work on this
document vs using the annotations, but the JSON representation of some items
isn't out of the box without an additional library.
Ryan covered some reasons to have additional string constants in the path,
which I agree with.
We'd need more routes than what you proposed still, such as removeProperties
and setProperties (on a namespace or on a table, but two separate issues). And
renameTable.
For now this PR mostly aims to cover `catalog` related issues and
non-transaction based concerns on tables (so simple gets, list, and presently
the main drop interface - RD from CRUD on tables).
Sorry for the long response, but thank you for your thorough review. I
wanted to give people enough time to respond before working on it further given
it became somewhat of a quick back and forth and then I was on vacation.
Thanks again @rymurr!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]