[ 
https://issues.apache.org/jira/browse/NIFI-14753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18007318#comment-18007318
 ] 

David Handermann commented on NIFI-14753:
-----------------------------------------

Thanks for summarizing the current issue and describing some options.

The Context section mentions a ByteArrayOutputStream, is that in reference to a 
test scenario?

To the substance of the issue, there are several things to consider in terms of 
expected behavior. Schema inference is an inherently best-effort approach, and 
will always be subject to problems when newer data does not match older schema 
definitions. If individual records can vary that significantly, then schema 
evaluation should be deferred, and consuming from a stream should be limited to 
preserve the original record content, which is one of the existing 
configuration options.

If the expected schema is within certain boundaries, then a predefined schema 
with optional types and fields is one  way to go. The other approach is a much 
more lenient schema inference approach. Making some adjustments around the 
edges, such as using {{long}} for any integral number, may help in scenarios 
like the one described.

Beyond improving schema handling, various types of exceptions should trigger 
some type of failure behavior. That is the purpose of the parse.failure 
relationship, so routing to that relationship may be the best option for this 
scenario. Routing to some new relationship is another approach to consider 
depending on the Processor.

In any exception case, the solution is not attempting to avoid writing to a 
FlowFile output stream, but to write applicable records to a failure FlowFile 
instead.

At this point, I'm not seeing any initial need for changes to the 
WriteJsonResult class. As mentioned in the discussion for NIFI-14696, other 
Record Writers have the option to implement different strategies for the level 
of leniency in schema handling, but the current JSON Record Writer looks like 
it should retain existing behavior in general.

If there are other similar failure scenarios to the one described, that may 
also be helpful to consider. In these cases, however, evaluating the larger 
context of options is the way to go.

> WriteJsonResult leaves the ByteArrayOutputStream in inconsistent state after 
> trying to write an incompatible schema record
> --------------------------------------------------------------------------------------------------------------------------
>
>                 Key: NIFI-14753
>                 URL: https://issues.apache.org/jira/browse/NIFI-14753
>             Project: Apache NiFi
>          Issue Type: Bug
>          Components: Extensions
>    Affects Versions: 2.4.0
>            Reporter: Dariusz Seweryn
>            Priority: Minor
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> h1. Context
> Consider a WriteJsonResult for schema ["Int Field Name": INT].
> Writing a record with ["Int Field Name": Long.MAX_VALUE] throws an 
> IllegalTypeConversionException — this is expected. Unfortunately the 
> ByteArrayOutputStream (BAOS) is then left with content that is not a valid 
> JSON:
> { "Int Field Name" }
> See [this 
> discussion|https://github.com/apache/nifi/pull/10053#discussion_r2190950596].
> h1. IllegalTypeConversionException handling
> This situation has the following consequences for exception handling apart 
> from handling the offending record in the processor that tried to write the 
> record:
> The same malformed record will most probably surface/cause a warning log in a 
> following processor which can be surprising without the knowledge of this 
> behavior in WriteJsonResult (unless all other processors assume that 
> Malformed Records may appear in consumed FlowFiles and silently drop such 
> records). {_}This can lead to unnecessary work being wasted by people 
> triaging such warning logs{_}.
> h1. Ideal solution
> WriteJsonResult in case of IllegalTypeConversionException does not write any 
> data to the BAOS.
> Such approach would need to buffer all writes to the FlowFile until full 
> record is processed, executing them afterwards. This may impact performance.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to