paul-rogers commented on pull request #2251:
URL: https://github.com/apache/drill/pull/2251#issuecomment-877884101


   Let's now move onto the primary topic of this PR: plugins. Again, this is a 
complex topic, especially if we think about how the solution would work at 
scale. I'm afraid that, since the design is so sparse and leaves out many 
details, that we're perhaps going down the wrong path. Let's take each topic in 
turn so we can see what's what.
   
   First, what is the structure we want? To answer this we need to separate the 
*code* (Java libraries, which I'll call a "connector", following Presto) and 
*configs* (the JSON things that users see, what we normally mean when we say 
"plugin".)
   
   Drill will only ever work with a fixed set of connectors: those available on 
the class path at runtime. AFAIK, there has never been interest in adding 
connectors at runtime (no "dynamic connectors") and, for security reasons, I'm 
not sure that doing so would even be a good idea.
   
   Connectors are presently used by any number of configs. And, every query 
creates an instance of the objects in the connector (the readers, the planner 
objects, etc.) So, the structure we have is:
   
   * Config (a JSON object in the persistent store)
   * Connector (a library)
     * Plugin (an instance of the plugin class from a connector, along with its 
config.)
       * Readers, planner objects, created per-query from a plugin
   
   The plugin registry binds connectors to configs to produce plugins (where by 
"plugin" I mean an instance of the plugin class initialized with a config.) 
This is some mighty complex code! It has to deal with a bunch of distributed 
system and concurrency-related issues. I'll omit the details for now.
   
   So, what does this mean? For one, it means we only need *one* copy of the 
connector. The "new" plugin registry is designed to handle this idea. My "class 
path plugin" and your "class path plugin" use the same plugin library. I'll 
assume that the answer above, that each user has their own copy of the code, 
represents a misunderstanding because of Drill's horribly ambiguous names in 
this area.
   
   Next, the plugin registry holds two kinds of "plugins". First are those 
configured in the UI and that can be resolved in queries. Familiar things like 
`cp`, `dfs` and so on. Second are ad-hoc plugins: those created on-the-fly 
based on table properties in a query, or plugin properties passed along in a 
query definition. This is done so that queries work even if, right after 
planning completes, someone deletes or changes the stored config: the next 
query sees the new config, the executing query uses the config stored in the 
query. (Else, disaster would result when some Drillbits see one config, others 
another.)
   
   There is a system plugin that offers the system tables. This one is meant to 
be shared by all users. It has no config options, but if it did, they would be 
set at the system level by the admin. It makes no sense for each user to have 
their own copy. (It might make sense to disallow certain system tables for some 
users, but that is a different question.)
   
   Now we have the "regular" configured plugins: a JSON config and an 
associated plugin instance. The goal here is to define a new level so that 
individual users have these items.
   
   First, I'll question if the requirements are correct. Do we really want 
plugins associated with each user? If you and I both use tale "foo", do I have 
to ask you to send me your JSON so I can create a new copy? What if you change 
something? What if we have 100s of users all with copies? Making copies is a 
"demo only" feature, it does not scale in production. Remember: reuse by copy 
and paste is for amateurs who will throw away their solution, not for 
professionals who want to minimize total costs.
   
   Let's think of a use case. A firm has five departments with 10 people per 
department. Some people in one department can see some data in another. For 
example, the VP of marketing is allowed to see Sales and Dev data. The CFO can 
see everything. Oh, and the employees change regularly.
   
   This is not a new idea. How do other tools handle this? Oracle? SQLServer? 
They operate at the level of schemas (databases) and tables. So, think of a 
plugin (config in storage, config + connector at runtime) as a shared, named 
object.
   
   So, what we want are groups of plugins. "Marketing", "Sales", "Ops". Within 
each are the configs for that set of tables. Users can be given permissions on 
the whole group, or on individual plugins. Administrators or DBAs (those who 
understand production systems) set up the configs. Users use those to which 
they are given permission. Or, if we want to be cheap, we have a huge global 
name space of plugins, and require all names to be globally unique, but you can 
only see those to which you have been given permission. (How will this work 
with 100 plugins and 1000 users? That results in 10,000 distinct permissions to 
maintain.)
   
   Users might want to create their own, individual configs not meant to be 
shared. Fine. Have a "user" group that holds configs for just that user. If I 
want to promote my new "foo" config, I ask an Admin to add it to the "Dev" 
group so you can see it.
   
   If the above is roughly in the ball park, we'd want to work out the details 
in a detailed design. There are likely a dozen "gotchas" that the above 
omitted. Who removes configs for users no longer with the organization? What 
connectors is a user allowed to create configs for? How can the admin see all 
configs to police users or track down odd behavior "some user did something 
silly that puts huge load on an external system"? 
   
   Think of security: who decides who gets permission on what? Is it done 
manually? With what UI? How does that work in an organization of 10 people? 
100? 1000? Since we are dealing with permissions based on roles (role-based 
security), can/should we tie onto the corporate LDAP or similar system? What is 
the API for that?
   
   Configs should hold security keys, but those should not be in JSON, stored 
in the persistent store. (Drill has *always* done this wrong for systems other 
than MFS; another reason Drill fails in production.) How are configs connected 
with a "vault" to get keys, passwords, tickets and the like at runtime? On S3, 
the use of temporary tokens is the right way to do permissions, no one gives 
away S3 keys any more. So, how does a plugin obtain and refresh its S3 token?
   
   These are hard topics. Thanks for taking on this big challenge! Suggestion: 
pick a small subset to start with, but ensure that subset allows us to add the 
pieces needed when the solution moves into production.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to