Hi Team,

Recently there was a request [1] to support splitting a flow file into
multiple flow files using the python FlowFileTransform API, which
would result in multiple outgoing flow files. A valid use case was
presented for this: "Input is a single flowfile which contains an
excel file, and output would be multiple flowfiles, where each
flowfile will contain one sheet from the excel file.".

As Joe Witt commented on the ticket the current APIs only support the
one flowfile in/one flowfile out model, whereas this is a request to
add python API support of the model of single flow file in and several
flow files out. I think this is a good idea and I think it could be
generalized for other types of python processors as well.

There was a merged PR [2] to support source python processors, and I
think we should also support multiple flow file outputs for source
processors too. There could be use cases like the ListenTCP processor
or any polling processor that could periodically be checking a queue
and creating flow files from all the new entries since the last
trigger. Even though a source processor could be written in a way to
return multiple records in a single flow file and then splitting it
with the SplitRecord processor, but it's more of a workaround than a
solution.

With the previously mentioned polling type of processor there could be
triggers when no new entries are available at all, so no flow file can
be generated. Because of this I also suggested a change to the API to
allow returning no new flow files in a trigger [3]. We may also
consider adding the option to yield for some time in this case.

So there are a couple of questions to the community:

1. Do you agree to add support for multiple flow file outputs on the
python API for both transform and source flow files?
2. Do you agree to add the support for returning with no flow files
from source processors?
3. Do you think we should add an option to yield in case no output
files are returned or that complicates the API way too much for a
user?

I also think these changes should be implemented before the NiFi 2.0 release.

As I talked with Peter Gyori he said he had already started working on
the "no output" feature and said he would be happy to work on the
multiple flow file output change as well. I would also be happy to
help him and port these changes on the MiNiFi C++ side.

Feel free to comment with any request or requirement on the related API change.

Regards,
Gabor

[1] https://issues.apache.org/jira/browse/NIFI-13402
[2] https://github.com/apache/nifi/pull/9000
[3] https://issues.apache.org/jira/browse/NIFI-13604

Reply via email to