Re: Handling of the contents of the failure result in the python FlowFileTransform API

Marton Szasz Mon, 05 Aug 2024 08:54:03 -0700

Good points raised. My thinking was that with exceptions and causing abigger issue at runtime, hopefully Python processor developers would bemore likely to notice any issues during development, and the issueswouldn't ever reach production. But the flip side is that there is arisk of an unforeseen failure scenario happening in prod, in which casebeing stuck in a rollback loop would be detrimental.

I'm sure Mark is much better at judging which approach would better suitthe project goals. From the discussion it seems like the only feasiblesolution is logging an error and routing the original flow file with theoriginal contents to failure, possibly with a few added attributes. Ihope an error log is hopefully loud enough for it to be noticed atdevelopment time.


Responding to a few comments:

> the idea of significant API changes in order to make such a thingdifficult is probably far beyond the scope of this threadI'm open to such solutions, but with the 2.0 GA coming soon, we don'thave much time left to break the API. I'd keep the current changessmall, and not API-breaking, but if there is community interest, I'mhappy to take part in discussions about breaking changes.

> What if you return the input payload/FlowFile Content along with the‘failure’ relationship?I think in the case of failure, we should ignore the returned contents,even if they are identical to the input flow file contents, and log theerror just like as if it was a different content.


> What happens if an Exception is thrown?

I think for this issue, we're going with logging, but if an exception isthrown in any case, I think we should catch it on the Python side, andtranslate it to a route to failure before it reaches the Java side. Iwould log an error, possibly including the Python stack trace, and maybeadd the error message to a flow file attribute e.g. 'error.message'.

> I also think that even though it is documented, it is not restrictedto transfer a new flow file to the original relationship. That shouldalso be logged and removed from the session.I like the idea of having a similar error log when the user tries tomanually transfer the flow file to 'original'. And we should probablydiscard the incorrect transfer, and keep the automatic transfer of theoriginal that the framework is doing.


Thanks for the inputs,
Marton

On 8/5/24 17:07, Gábor Gyimesi wrote:

I was thinking the same that first I also wanted to throw an
exception, but that would just trigger a rollback and we would retry
the flow file later. Still it is reasonable to at least write an error
message in the logs about the bad usage.

I also think that even though it is documented, it is not restricted
to transfer a new flow file to the original relationship. That should
also be logged and removed from the session. I opened a pull request
for these changes in MiNiFi C++, maybe it could also be implemented in
NiFi [1].

[1] https://github.com/apache/nifi-minifi-cpp/pull/1852

On Mon, 5 Aug 2024 at 15:25, Mark Payne <marka...@hotmail.com> wrote:

Marton,

Sure, the idea of “an API should be easy to use correctly and hard to use 
incorrectly” is certainly a great principle in API design.
But I’m not entirely sure that it applies here directly. The idea of “make it 
hard to use incorrectly” doesn’t mean to throw an Exception
when used incorrectly but rather to make the API such that you’d never even 
return an object that is invalid. But the idea of significant
API changes in order to make such a thing difficult is probably far beyond the 
scope of this thread.

That said, we can consider throwing an Exception if returning a different 
payload along with the ‘failure’ relationship. But then
we would need to consider several different concepts, at a higher application 
level:

1. What if you return the input payload/FlowFile Content along with the 
‘failure’ relationship? Should it be a failure? Or only if the content has 
changed? If considering only the API then it would make sense that returning 
the same content would be okay (though inefficient). But that would mean also 
doing some sort of hash on the returned content and the original content to 
ensure that it’s the same, etc. which is probably a very bad idea due to 
performance concerns.

2. What happens if an Exception is thrown? How does the Exception get handled? 
We have two options:

- Catch the Exception in the Processor on the Java side; in this case we should 
handle it like we do other failures: log an error and route the original 
FlowFile to the ‘failure’ relationship. Now, we’ve not really accomplished 
anything because the net result is the same - the original FlowFile goes to 
failure. But we’ve wasted a lot of resources and complicated the codebase to do 
so.

- Let the Exception fly. Don’t bother catching it at all on the Java side, 
unless the framework ultimately catches it. The framework will then log an 
Exception and rollback the session, putting the data back into the queue. Given 
that the data was intended to be routed to ‘failure’ and this is a 
Transformation type of Processor (given that we currently only support 
transformations in python) the transformation will fail again. And again. And 
the data will just stay in the queue forever, continually failing. While this 
behavior of the framework makes sense, it is a “last ditch effort” to keep 
things in line and generally can be avoided if there’s a better option. And 
given that we already know the developer of the Processor wants to route to 
failure, I’d say the better option is to route the original FlowFile to failure 
to begin with.

Thanks
-Mark

On Aug 1, 2024, at 8:37 AM, Marton Szasz <sza...@apache.org> wrote:

I think good APIs are easy to use correctly and hard to use incorrectly. (I 
stole this from C++ books.)

For this reason, I'm +1 for throwing an exception when the API user specifies a 
content with a transform result headed to failure. Also +1 for making this 
clear in the docs.

If an exception is not possible or feasible, an error log works, too. But we 
should make it as loud as feasible, so the user notices the programming error 
early.

Thanks,
Marton

On 7/31/24 19:52, Mark Payne wrote:

Agreed, it should absolutely be documented. I don’t have a particularly strong 
opinion regarding whether or not to log a warning.

Thanks
-Mark

On Jul 31, 2024, at 1:00 PM, Gábor Gyimesi <gamezb...@gmail.com> wrote:

Hi,

MiNiFi C++ copied the behavior of the NiFi Python API to be compatible
with the NiFi python processors (although some minor differences may
apply [1]), so it works the same way in this case.
I think this behavior is understandable and can be accepted, but I
think Ferenc has a point that this at least should be documented in
the developer guide. The interface of the FlowFileTransformResult
suggests that the content can be specified for any relationship, but
it is discarded in this case without any notice. I suppose it would be
an overkill and not backwards compatible to change the API for adding
a failure specific transfer, but I think besides documenting this
behavior we could add a warning in the logs (or an exception?) if
someone specifies any content for this relationship.

Regards,
Gábor

[1] 
https://github.com/apache/nifi-minifi-cpp/blob/main/extensions/python/PYTHON.md#using-nifi-python-processors

On Wed, 31 Jul 2024 at 18:24, Mark Payne <marka...@hotmail.com> wrote:

Hi Ferenc,

I’m not overly familiar with MiNiFi C++ but this is how it works in traditional 
NiFi as well.
This is intended. While Processors are generally free to treat relationships as 
they wish,
the convention is always to route the original incoming FlowFile to ‘failure’, 
never a modified version of it.
At least with respect to content. Some attributes might be added, for example 
to explain why there was a failure,
but the content should not be changed.

In the Python API we’ve generally moved toward being more prescriptive about 
some of these things.
So while it’s possible to do so in Java, it should not be done. In Python, we 
don’t even allow it.

Thanks
-Mark

On Jul 31, 2024, at 10:46 AM, Ferenc Gerlits <fgerl...@apache.org> wrote:

Hello NiFi community,

We have implemented NiFi's FlowFileTransform API for Python processors
in MiNiFi C++, and we have written a few processors using it. We have
noticed that the "failure" relationship works like this: if the
transform() method returns
FlowFileTransformResult(relationship="failure",
contents=some_contents, attributes=some_attributes), then
- some_contents are discarded,
- some_attributes are merged with the attributes of the original flow file,
- and then the original flow file with the merged attributes is
routed to the failure relationship. [1]

The fact that the contents are discarded was unexpected to me, and it
is not documented in the developer guide [2]. We could add this to the
developer guide, but I think it would make more sense if the failure
result worked in the same way as the success result: if contents is
None, then we use the original contents, but if contents is not None,
then it becomes the new contents of the flow file which is routed to
the failure relationship.

What do you think?
Ferenc

[1] 
https://github.com/apache/nifi/blob/main/nifi-extension-bundles/nifi-py4j-bundle/nifi-py4j-bridge/src/main/java/org/apache/nifi/python/processor/FlowFileTransformProxy.java
[2] 
https://github.com/apache/nifi/blob/main/nifi-docs/src/main/asciidoc/python-developer-guide.adoc

Re: Handling of the contents of the failure result in the python FlowFileTransform API

Reply via email to