Good points raised. My thinking was that with exceptions and causing a
bigger issue at runtime, hopefully Python processor developers would be
more likely to notice any issues during development, and the issues
wouldn't ever reach production. But the flip side is that there is a
risk of an unforeseen failure scenario happening in prod, in which case
being stuck in a rollback loop would be detrimental.
I'm sure Mark is much better at judging which approach would better suit
the project goals. From the discussion it seems like the only feasible
solution is logging an error and routing the original flow file with the
original contents to failure, possibly with a few added attributes. I
hope an error log is hopefully loud enough for it to be noticed at
development time.
Responding to a few comments:
> the idea of significant API changes in order to make such a thing
difficult is probably far beyond the scope of this thread
I'm open to such solutions, but with the 2.0 GA coming soon, we don't
have much time left to break the API. I'd keep the current changes
small, and not API-breaking, but if there is community interest, I'm
happy to take part in discussions about breaking changes.
> What if you return the input payload/FlowFile Content along with the
‘failure’ relationship?
I think in the case of failure, we should ignore the returned contents,
even if they are identical to the input flow file contents, and log the
error just like as if it was a different content.
> What happens if an Exception is thrown?
I think for this issue, we're going with logging, but if an exception is
thrown in any case, I think we should catch it on the Python side, and
translate it to a route to failure before it reaches the Java side. I
would log an error, possibly including the Python stack trace, and maybe
add the error message to a flow file attribute e.g. 'error.message'.
> I also think that even though it is documented, it is not restricted
to transfer a new flow file to the original relationship. That should
also be logged and removed from the session.
I like the idea of having a similar error log when the user tries to
manually transfer the flow file to 'original'. And we should probably
discard the incorrect transfer, and keep the automatic transfer of the
original that the framework is doing.
Thanks for the inputs,
Marton
On 8/5/24 17:07, Gábor Gyimesi wrote:
I was thinking the same that first I also wanted to throw an
exception, but that would just trigger a rollback and we would retry
the flow file later. Still it is reasonable to at least write an error
message in the logs about the bad usage.
I also think that even though it is documented, it is not restricted
to transfer a new flow file to the original relationship. That should
also be logged and removed from the session. I opened a pull request
for these changes in MiNiFi C++, maybe it could also be implemented in
NiFi [1].
[1] https://github.com/apache/nifi-minifi-cpp/pull/1852
On Mon, 5 Aug 2024 at 15:25, Mark Payne <marka...@hotmail.com> wrote:
Marton,
Sure, the idea of “an API should be easy to use correctly and hard to use
incorrectly” is certainly a great principle in API design.
But I’m not entirely sure that it applies here directly. The idea of “make it
hard to use incorrectly” doesn’t mean to throw an Exception
when used incorrectly but rather to make the API such that you’d never even
return an object that is invalid. But the idea of significant
API changes in order to make such a thing difficult is probably far beyond the
scope of this thread.
That said, we can consider throwing an Exception if returning a different
payload along with the ‘failure’ relationship. But then
we would need to consider several different concepts, at a higher application
level:
1. What if you return the input payload/FlowFile Content along with the
‘failure’ relationship? Should it be a failure? Or only if the content has
changed? If considering only the API then it would make sense that returning
the same content would be okay (though inefficient). But that would mean also
doing some sort of hash on the returned content and the original content to
ensure that it’s the same, etc. which is probably a very bad idea due to
performance concerns.
2. What happens if an Exception is thrown? How does the Exception get handled?
We have two options:
- Catch the Exception in the Processor on the Java side; in this case we should
handle it like we do other failures: log an error and route the original
FlowFile to the ‘failure’ relationship. Now, we’ve not really accomplished
anything because the net result is the same - the original FlowFile goes to
failure. But we’ve wasted a lot of resources and complicated the codebase to do
so.
- Let the Exception fly. Don’t bother catching it at all on the Java side,
unless the framework ultimately catches it. The framework will then log an
Exception and rollback the session, putting the data back into the queue. Given
that the data was intended to be routed to ‘failure’ and this is a
Transformation type of Processor (given that we currently only support
transformations in python) the transformation will fail again. And again. And
the data will just stay in the queue forever, continually failing. While this
behavior of the framework makes sense, it is a “last ditch effort” to keep
things in line and generally can be avoided if there’s a better option. And
given that we already know the developer of the Processor wants to route to
failure, I’d say the better option is to route the original FlowFile to failure
to begin with.
Thanks
-Mark
On Aug 1, 2024, at 8:37 AM, Marton Szasz <sza...@apache.org> wrote:
I think good APIs are easy to use correctly and hard to use incorrectly. (I
stole this from C++ books.)
For this reason, I'm +1 for throwing an exception when the API user specifies a
content with a transform result headed to failure. Also +1 for making this
clear in the docs.
If an exception is not possible or feasible, an error log works, too. But we
should make it as loud as feasible, so the user notices the programming error
early.
Thanks,
Marton
On 7/31/24 19:52, Mark Payne wrote:
Agreed, it should absolutely be documented. I don’t have a particularly strong
opinion regarding whether or not to log a warning.
Thanks
-Mark
On Jul 31, 2024, at 1:00 PM, Gábor Gyimesi <gamezb...@gmail.com> wrote:
Hi,
MiNiFi C++ copied the behavior of the NiFi Python API to be compatible
with the NiFi python processors (although some minor differences may
apply [1]), so it works the same way in this case.
I think this behavior is understandable and can be accepted, but I
think Ferenc has a point that this at least should be documented in
the developer guide. The interface of the FlowFileTransformResult
suggests that the content can be specified for any relationship, but
it is discarded in this case without any notice. I suppose it would be
an overkill and not backwards compatible to change the API for adding
a failure specific transfer, but I think besides documenting this
behavior we could add a warning in the logs (or an exception?) if
someone specifies any content for this relationship.
Regards,
Gábor
[1]
https://github.com/apache/nifi-minifi-cpp/blob/main/extensions/python/PYTHON.md#using-nifi-python-processors
On Wed, 31 Jul 2024 at 18:24, Mark Payne <marka...@hotmail.com> wrote:
Hi Ferenc,
I’m not overly familiar with MiNiFi C++ but this is how it works in traditional
NiFi as well.
This is intended. While Processors are generally free to treat relationships as
they wish,
the convention is always to route the original incoming FlowFile to ‘failure’,
never a modified version of it.
At least with respect to content. Some attributes might be added, for example
to explain why there was a failure,
but the content should not be changed.
In the Python API we’ve generally moved toward being more prescriptive about
some of these things.
So while it’s possible to do so in Java, it should not be done. In Python, we
don’t even allow it.
Thanks
-Mark
On Jul 31, 2024, at 10:46 AM, Ferenc Gerlits <fgerl...@apache.org> wrote:
Hello NiFi community,
We have implemented NiFi's FlowFileTransform API for Python processors
in MiNiFi C++, and we have written a few processors using it. We have
noticed that the "failure" relationship works like this: if the
transform() method returns
FlowFileTransformResult(relationship="failure",
contents=some_contents, attributes=some_attributes), then
- some_contents are discarded,
- some_attributes are merged with the attributes of the original flow file,
- and then the original flow file with the merged attributes is
routed to the failure relationship. [1]
The fact that the contents are discarded was unexpected to me, and it
is not documented in the developer guide [2]. We could add this to the
developer guide, but I think it would make more sense if the failure
result worked in the same way as the success result: if contents is
None, then we use the original contents, but if contents is not None,
then it becomes the new contents of the flow file which is routed to
the failure relationship.
What do you think?
Ferenc
[1]
https://github.com/apache/nifi/blob/main/nifi-extension-bundles/nifi-py4j-bundle/nifi-py4j-bridge/src/main/java/org/apache/nifi/python/processor/FlowFileTransformProxy.java
[2]
https://github.com/apache/nifi/blob/main/nifi-docs/src/main/asciidoc/python-developer-guide.adoc