[GitHub] [druid] paul-rogers commented on pull request #13165: Druid Catalog basics

GitBox Fri, 28 Oct 2022 16:19:19 -0700


paul-rogers commented on PR #13165:
URL: https://github.com/apache/druid/pull/13165#issuecomment-1295634506


   > One thing I think would be good is if in the URI path, when specifying a 
table within a specific schema, the schema should come before table in the path
   
   As it turns out, the schema does come before the table:
   
   `POST /resource/tables/{schema}/{table}[?version={n}|overwrite=true|false]`
   
   I suspect the confusing bit is the "/resource/tables". The thought here was 
that, at present, the catalog has only tables. Most DB allow user-defined 
schemas. We may want to add connections for things like S3, Kafka, etc. So, the 
thought was we'd have a variety of "resources". Each with some naming 
convention. So, in the future (not now):
   
   `POST /resource/schemas/{schema}`
   `POST /resource/connections/{conn}`
   
   Etc.
   
   Here, we could simplify: the `resource` part could be removed:
   
   `POST /tables/{schema}/{table}`
   `POST /schemas/{schema}`
   `POST /connections/{conn}`
   
   (Everything here is simplified, BTW, there is a common prefix which I'm 
omitting.)
   
   > Maybe the schema can be included in the request payload
   
   That is one solution. To add, `POST /tables`. The problem is, the result is 
asymmetric on get: `GET /tables` might return everything, so one would do `GET 
/tables/{schema}/{table}`. Plus, the content would differ: to create we provide 
a name, but to get, we don't need a response with the name because we already 
have the name.
   
   Then, there is the update ambiguity: `POST /tables/{schema}/{name}` says 
which table we want to update. If the name also appears in the request, then we 
(and the user) would have to ensure that they match. I suppose we could do 
update as `POST /tables` with the name in the body...
   
   A final comment is that the present design says that the name is the place 
you store your table spec: it isn't an attribute of the spec. This means I can 
post the same spec under multiple names: one for dev, another for test, and 
another for prod. (Since Druid doesn't allow user-defined namespaces, the best 
thing is "dev_events", "test_events" and "event" for dev, test and prod.) If 
the name were in the spec, then the spec would have to be modified for each 
use. (And, the DB record would store the name twice: once in the key field, 
another in the spec, resulting in redundancy and another thing to verify on 
every update.)
   
   The existing, and proposed, designs allow the same spec format for create, 
update and read. It allows the same spec to be posted to dev, test and prod 
tables. I think we want to keep each of these features. That said, I'm open to 
revisions about _how_ we provide those features.
   
   > its a bit awkward that the table is specified as a resource in some apis, 
and as a entry in others
   
   The difference in "themes" was due to the `/schemas/{schema}/{table}`/ 
`schemas/{schema}/{operation}` ambiguity if we do the obvious solution and try 
to use a common base for both. But, if we're OK with 
`/schemas/{schema}/{operation}` and `/tables/{schema}/{table}/{operation}`, 
then we can almost, but not quite, combine resources and entries. 
   
   For tables:
   
   * `POST /tables/{schema}/{table}[?version={n}|overwrite=true|false]` 
add/update a table
   * `POST /tables/{schema}/{table}/edit` "edit" (incremental update) a table
   * `GET /tables/{schema}/{table}` Get the table spec for a table (same object 
a for create/update)
   * `GET /tables/{schema}/{table}/metadata` Get the table metadata (name, 
update date, state, spec, etc.)
   * `GET /tables` Get metadata for all tables in all schemas
   
   For schemas:
   
   * `POST /schemas/{schema}` Create a schema (not yet supported!)
   * `GET /schemas/{schema}` Get metadata for a schema (not yet supported!)
   * `GET /schemas/{schema}/names` Get the names of tables within the schema
   * `GET /schemas/{schema}/tables` Get the metadata for each table in the 
schema
   
   In the above, however, there is no good way to get the names of all tables 
in all schemas: `GET /schemas/names` won't work (ambiguous). `GET /schemas` 
won't work (would imply getting the metadata (contents) for all schemas.
   
   This is the "trying to be too clever" issue that made the original API a bit 
awkward: had to do some song and dance to work around ambiguities.
   
   The proposal in the earlier message resolves these issues by saying _what 
you want to do_, then saying, _what you want to do it on_. That way, to get 
names:
   
   * `GET /names/schemas` says to get all schema names
   * `GET /names/tables` says to get all table names in all schemas
   * `GET /names/schema/{schema}` says to get all table names in the given 
schema
   
   (The above is a refinement of the earlier proposal.)
   
   The same pattern is then repeated for metadata with `entries`.
   
   Again, I think we need the lists of names, and the lists of contents. We 
need it for the whole system, for everything in a schema, and for a single 
table. Again, I'm open to other ways of accomplishing the goals.
   
   Ideas?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] paul-rogers commented on pull request #13165: Druid Catalog basics

Reply via email to