[DISCUSS] Python processor dependency management

Gábor Gyimesi Wed, 18 Sep 2024 04:52:03 -0700

Hi Team,

I would like to discuss the current dependency management and possible
improvements for python processors. At the moment there are two ways
to specify dependencies, either on a package level using a
requirements.txt file to list all the dependencies for the processors
in that package or inline under the processor's ProcessorDetails for
standalone processor usage. As per the NiFi python developer's guide a
python package in the python extensions would look something like
this:


my-python-package/
│
├── __init__.py
│
├── ProcessorA.py
│
├── ProcessorB.py
│
└── requirements.txt

If we have several dependencies for ProcessorA and ProcessorB it's
sensible to list them in the requirements.txt file.

But what if ProcessorA has additional dependencies (like dependency Y)
that could be adding up to GBs of packages that are not needed by
ProcessorB?

my-python-package/
│
├── __init__.py
│
├── ProcessorA.py - with dependency X and Y
│
├── ProcessorB.py - with dependency X
│
├── common_utils.py - with dependency X
│
└── requirements.txt

As we create separate virtual environments for each processor in the
flow, if we only use ProcessorB it would make sense not to load all
the dependencies only needed by ProcessorA. Unfortunately at the
moment if there are multiple processors in a package, every module is
loaded when creating the processor's environment, so all dependencies
for all modules in the package need to be installed.

It may be a better approach to be able to have the common dependencies
for the package in the requirements.txt and have the processor
specific dependencies defined inline. In that case the dependency
installer should collect all needed dependencies only for that
specific processor in the flow and install them at once.

It would also be better not to load other processors' modules when
creating a virtualenv for one specific processor. Other processor
class modules could be identified in a package and skipped when
loading a package for a specific processor. Of course if we have
utility files for other processors (like having a
common_utils_for_A.py with dependency Y) it may be hard to identify
that we do not need to load that for our virtualenv for ProcessorB.
But if a processor specific dependency is only used in the processor's
class module this issue can be avoided.

I'm curious about everyone's opinion, what would be the best approach
in this use case and how could this be improved in the future.

Regards,
Gabor

[DISCUSS] Python processor dependency management

Reply via email to