paul-rogers commented on pull request #2251:
URL: https://github.com/apache/drill/pull/2251#issuecomment-877884101
Let's now move onto the primary topic of this PR: plugins. Again, this is a
complex topic, especially if we think about how the solution would work at
scale. I'm afraid that, since the design is so sparse and leaves out many
details, that we're perhaps going down the wrong path. Let's take each topic in
turn so we can see what's what.
First, what is the structure we want? To answer this we need to separate the
*code* (Java libraries, which I'll call a "connector", following Presto) and
*configs* (the JSON things that users see, what we normally mean when we say
"plugin".)
Drill will only ever work with a fixed set of connectors: those available on
the class path at runtime. AFAIK, there has never been interest in adding
connectors at runtime (no "dynamic connectors") and, for security reasons, I'm
not sure that doing so would even be a good idea.
Connectors are presently used by any number of configs. And, every query
creates an instance of the objects in the connector (the readers, the planner
objects, etc.) So, the structure we have is:
* Config (a JSON object in the persistent store)
* Connector (a library)
* Plugin (an instance of the plugin class from a connector, along with its
config.)
* Readers, planner objects, created per-query from a plugin
The plugin registry binds connectors to configs to produce plugins (where by
"plugin" I mean an instance of the plugin class initialized with a config.)
This is some mighty complex code! It has to deal with a bunch of distributed
system and concurrency-related issues. I'll omit the details for now.
So, what does this mean? For one, it means we only need *one* copy of the
connector. The "new" plugin registry is designed to handle this idea. My "class
path plugin" and your "class path plugin" use the same plugin library. I'll
assume that the answer above, that each user has their own copy of the code,
represents a misunderstanding because of Drill's horribly ambiguous names in
this area.
Next, the plugin registry holds two kinds of "plugins". First are those
configured in the UI and that can be resolved in queries. Familiar things like
`cp`, `dfs` and so on. Second are ad-hoc plugins: those created on-the-fly
based on table properties in a query, or plugin properties passed along in a
query definition. This is done so that queries work even if, right after
planning completes, someone deletes or changes the stored config: the next
query sees the new config, the executing query uses the config stored in the
query. (Else, disaster would result when some Drillbits see one config, others
another.)
There is a system plugin that offers the system tables. This one is meant to
be shared by all users. It has no config options, but if it did, they would be
set at the system level by the admin. It makes no sense for each user to have
their own copy. (It might make sense to disallow certain system tables for some
users, but that is a different question.)
Now we have the "regular" configured plugins: a JSON config and an
associated plugin instance. The goal here is to define a new level so that
individual users have these items.
First, I'll question if the requirements are correct. Do we really want
plugins associated with each user? If you and I both use tale "foo", do I have
to ask you to send me your JSON so I can create a new copy? What if you change
something? What if we have 100s of users all with copies? Making copies is a
"demo only" feature, it does not scale in production. Remember: reuse by copy
and paste is for amateurs who will throw away their solution, not for
professionals who want to minimize total costs.
Let's think of a use case. A firm has five departments with 10 people per
department. Some people in one department can see some data in another. For
example, the VP of marketing is allowed to see Sales and Dev data. The CFO can
see everything. Oh, and the employees change regularly.
This is not a new idea. How do other tools handle this? Oracle? SQLServer?
They operate at the level of schemas (databases) and tables. So, think of a
plugin (config in storage, config + connector at runtime) as a shared, named
object.
So, what we want are groups of plugins. "Marketing", "Sales", "Ops". Within
each are the configs for that set of tables. Users can be given permissions on
the whole group, or on individual plugins. Administrators or DBAs (those who
understand production systems) set up the configs. Users use those to which
they are given permission. Or, if we want to be cheap, we have a huge global
name space of plugins, and require all names to be globally unique, but you can
only see those to which you have been given permission. (How will this work
with 100 plugins and 1000 users? That results in 10,000 distinct permissions to
maintain.)
Users might want to create their own, individual configs not meant to be
shared. Fine. Have a "user" group that holds configs for just that user. If I
want to promote my new "foo" config, I ask an Admin to add it to the "Dev"
group so you can see it.
If the above is roughly in the ball park, we'd want to work out the details
in a detailed design. There are likely a dozen "gotchas" that the above
omitted. Who removes configs for users no longer with the organization? What
connectors is a user allowed to create configs for? How can the admin see all
configs to police users or track down odd behavior "some user did something
silly that puts huge load on an external system"?
Think of security: who decides who gets permission on what? Is it done
manually? With what UI? How does that work in an organization of 10 people?
100? 1000? Since we are dealing with permissions based on roles (role-based
security), can/should we tie onto the corporate LDAP or similar system? What is
the API for that?
Configs should hold security keys, but those should not be in JSON, stored
in the persistent store. (Drill has *always* done this wrong for systems other
than MFS; another reason Drill fails in production.) How are configs connected
with a "vault" to get keys, passwords, tickets and the like at runtime? On S3,
the use of temporary tokens is the right way to do permissions, no one gives
away S3 keys any more. So, how does a plugin obtain and refresh its S3 token?
These are hard topics. Thanks for taking on this big challenge! Suggestion:
pick a small subset to start with, but ensure that subset allows us to add the
pieces needed when the solution moves into production.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]