Hi Team, I would like to discuss the current dependency management and possible improvements for python processors. At the moment there are two ways to specify dependencies, either on a package level using a requirements.txt file to list all the dependencies for the processors in that package or inline under the processor's ProcessorDetails for standalone processor usage. As per the NiFi python developer's guide a python package in the python extensions would look something like this:
my-python-package/ │ ├── __init__.py │ ├── ProcessorA.py │ ├── ProcessorB.py │ └── requirements.txt If we have several dependencies for ProcessorA and ProcessorB it's sensible to list them in the requirements.txt file. But what if ProcessorA has additional dependencies (like dependency Y) that could be adding up to GBs of packages that are not needed by ProcessorB? my-python-package/ │ ├── __init__.py │ ├── ProcessorA.py - with dependency X and Y │ ├── ProcessorB.py - with dependency X │ ├── common_utils.py - with dependency X │ └── requirements.txt As we create separate virtual environments for each processor in the flow, if we only use ProcessorB it would make sense not to load all the dependencies only needed by ProcessorA. Unfortunately at the moment if there are multiple processors in a package, every module is loaded when creating the processor's environment, so all dependencies for all modules in the package need to be installed. It may be a better approach to be able to have the common dependencies for the package in the requirements.txt and have the processor specific dependencies defined inline. In that case the dependency installer should collect all needed dependencies only for that specific processor in the flow and install them at once. It would also be better not to load other processors' modules when creating a virtualenv for one specific processor. Other processor class modules could be identified in a package and skipped when loading a package for a specific processor. Of course if we have utility files for other processors (like having a common_utils_for_A.py with dependency Y) it may be hard to identify that we do not need to load that for our virtualenv for ProcessorB. But if a processor specific dependency is only used in the processor's class module this issue can be avoided. I'm curious about everyone's opinion, what would be the best approach in this use case and how could this be improved in the future. Regards, Gabor