[PR] inter-Extension dependency support (druid)

via GitHub Wed, 28 Aug 2024 13:50:13 -0700


georgew5656 opened a new pull request, #16973:
URL: https://github.com/apache/druid/pull/16973


   Allow specifying extensions between extensions without having to explicitly 
install jars in multiple places.
   
   ### Description
   Currently, in order to have extension A depend on extension B, you have to 
explicitly include extension B as a build dependency in extension A's pom.xml. 
This cause the jar for extension B to be included in extension A's directory 
and loaded as a module.
   
   This can lead to some weird behavior e.g. the following bug linked from 
https://github.com/apache/druid/pull/16929 caused by 
druid-kafka-extraction-namespace depending on druid-lookups-cached-global.
   ```
   There is logic in ExtensionsLoader.tryAdd that checks whether a module has 
already been loaded during initialization and skips it if it already has been 
loaded.
   
   This is a problem when both druid-kafka-extraction-namespace and 
druid-lookups-cached-global are specified because they both load 
NamespaceExtractionModule.
   
   If druid-kafka-extraction-namespace is specified first, both 
NamespaceExtractionModule and KafkaExtractionNamespaceModule are loaded by the 
druid-kafka-extraction-namespace classloader, and the 
druid-lookups-cached-global classloader doesn't load anything since 
NamespaceExtractionModule was already loaded. This is fine because the features 
of druid-lookups-cached-global are served through the module 
NamespaceExtractionModule being loaded in druid-kafka-extraction-namespace. 
(this is essentially the same behavior as just loading 
druid-kafka-extraction-namespace, and this is why loading both extensions in 
this order works)
   
   If druid-lookups-cached-global is specified first, NamespaceExtractionModule 
is loaded by the druid-lookups-cached-global class loader. The 
druid-kafka-extraction-namespace classloader will only load 
KafkaExtractionNamespaceModule because NamespaceExtractionModule has already 
been loaded. This is a problem because kafka lookups rely on classes bound in 
NamespaceExtractionModule that it can't access (because 
NamespaceExtractionModule is only bound in the druid-lookups-cached-global 
classloader).
   ```
   
   to get around this issue, my proposed fix is to allow chaining classloaders 
without having to actually copy jars. e.g. extension A won't actually have 
extension B's jar (we include the dependency as provided in the pom.xml). when 
extension A is loading classes, it will try to use extension B's classloader to 
find classes it can't find.
   
   #### Fixed the bug ...
   #### Renamed the class ...
   #### Added a forbidden-apis entry ...
   
   **Classloaders**
   We currently support two kinds of classloaders for each extension 
(extension-first or not). hadoop always uses the non extension-first 
classloader and other druid services check the 
druid.extensions.useExtensionClassloaderFirst property. 
   
   I made a assumption to simplify the change in addAllFromFileSystem() (the 
code in ExtensionsLoader where chained classloaders are set up). in this code 
we always use the classloader based off of 
druid.extensions.useExtensionClassloaderFirst, so if 
druid.extensions.useExtensionClassloaderFirst=true, the StandardClassLoader 
objects won't get chained classloading setup.
   
   It seems like this could be a issue with this line in HadoopTask 
(https://github.com/apache/druid/blob/master/indexing-service/src/main/java/org/apache/druid/indexing/common/task/HadoopTask.java#L158),
 because if druid.extensions.useExtensionClassloaderFirst=true is set, only the 
ExtensionFirstClassLoaders will have chained classloading setup, but it seems 
like all HadoopTask does is combine all the resource urls from all the 
extensions so I don't think this is a issue. plus this is the current behavior.
   
   this made it a little more annoying to implement chained classloading (see 
ExtensionLoader), but I didn't think we could deprecate this behavior so I left 
it in. I had to add a StandardClassLoader class because we were just using a 
regular UrlClassLoader object for the non extension-first classloader. In both 
StandardClassLoader and ExtensionFirstClassLoader i added the logic to check 
other extension's classloaders for classes. this is the part of the code i'd 
like some more detailed feedback on.
   
   **Extension dependency specification**
   I added a extension-dependencies.json per-extension resources file to 
specify inter-extension dependencies (see 
extensions-core/kafka-extraction-namespace/src/main/resources/extension-dependencies.json
 for a example)
   
   When loading extensions, ExtensionsLoader looks for a druid-* jar (the main 
extension code), and inspects it for a extension-dependencies.json resource 
file. If this file exists, it injects the correct chained classloaders to the 
extension classloader.
   
   I tested this with druid-kafka-extraction-namespace and 
druid-lookups-cached-global and everything seems to work okay.
   
   
   I haven't added any unit tests yet because I want to get feedback on whether 
this approach seems reasonable.
   
   #### Release note
   - support explicit extension dependencies in druid
   
   ##### Key changed/added classes in this PR
    * `ExtensionLoader`
    * `ExtensionFirstClassLoader`
    * `StandardClassLoader`
    * `PullDependencies`
   <hr>
   
   This PR has:
   
   - [X] been self-reviewed.
      - [ ] using the [concurrency 
checklist](https://github.com/apache/druid/blob/master/dev/code-review/concurrency.md)
 (Remove this item if the PR doesn't have any relation to concurrency.)
   - [ ] added documentation for new or modified features or behaviors.
   - [ ] a release note entry in the PR description.
   - [ ] added Javadocs for most classes and all non-trivial methods. Linked 
related entities via Javadoc links.
   - [ ] added or updated version, license, or notice information in 
[licenses.yaml](https://github.com/apache/druid/blob/master/dev/license.md)
   - [ ] added comments explaining the "why" and the intent of the code 
wherever would not be obvious for an unfamiliar reader.
   - [ ] added unit tests or modified existing tests to cover new code paths, 
ensuring the threshold for [code 
coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md)
 is met.
   - [ ] added integration tests.
   - [X] been tested in a test Druid cluster.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] inter-Extension dependency support (druid)

Reply via email to