[
https://issues.apache.org/jira/browse/NIFI-14753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18007318#comment-18007318
]
David Handermann commented on NIFI-14753:
-----------------------------------------
Thanks for summarizing the current issue and describing some options.
The Context section mentions a ByteArrayOutputStream, is that in reference to a
test scenario?
To the substance of the issue, there are several things to consider in terms of
expected behavior. Schema inference is an inherently best-effort approach, and
will always be subject to problems when newer data does not match older schema
definitions. If individual records can vary that significantly, then schema
evaluation should be deferred, and consuming from a stream should be limited to
preserve the original record content, which is one of the existing
configuration options.
If the expected schema is within certain boundaries, then a predefined schema
with optional types and fields is one way to go. The other approach is a much
more lenient schema inference approach. Making some adjustments around the
edges, such as using {{long}} for any integral number, may help in scenarios
like the one described.
Beyond improving schema handling, various types of exceptions should trigger
some type of failure behavior. That is the purpose of the parse.failure
relationship, so routing to that relationship may be the best option for this
scenario. Routing to some new relationship is another approach to consider
depending on the Processor.
In any exception case, the solution is not attempting to avoid writing to a
FlowFile output stream, but to write applicable records to a failure FlowFile
instead.
At this point, I'm not seeing any initial need for changes to the
WriteJsonResult class. As mentioned in the discussion for NIFI-14696, other
Record Writers have the option to implement different strategies for the level
of leniency in schema handling, but the current JSON Record Writer looks like
it should retain existing behavior in general.
If there are other similar failure scenarios to the one described, that may
also be helpful to consider. In these cases, however, evaluating the larger
context of options is the way to go.
> WriteJsonResult leaves the ByteArrayOutputStream in inconsistent state after
> trying to write an incompatible schema record
> --------------------------------------------------------------------------------------------------------------------------
>
> Key: NIFI-14753
> URL: https://issues.apache.org/jira/browse/NIFI-14753
> Project: Apache NiFi
> Issue Type: Bug
> Components: Extensions
> Affects Versions: 2.4.0
> Reporter: Dariusz Seweryn
> Priority: Minor
> Time Spent: 0.5h
> Remaining Estimate: 0h
>
> h1. Context
> Consider a WriteJsonResult for schema ["Int Field Name": INT].
> Writing a record with ["Int Field Name": Long.MAX_VALUE] throws an
> IllegalTypeConversionException — this is expected. Unfortunately the
> ByteArrayOutputStream (BAOS) is then left with content that is not a valid
> JSON:
> { "Int Field Name" }
> See [this
> discussion|https://github.com/apache/nifi/pull/10053#discussion_r2190950596].
> h1. IllegalTypeConversionException handling
> This situation has the following consequences for exception handling apart
> from handling the offending record in the processor that tried to write the
> record:
> The same malformed record will most probably surface/cause a warning log in a
> following processor which can be surprising without the knowledge of this
> behavior in WriteJsonResult (unless all other processors assume that
> Malformed Records may appear in consumed FlowFiles and silently drop such
> records). {_}This can lead to unnecessary work being wasted by people
> triaging such warning logs{_}.
> h1. Ideal solution
> WriteJsonResult in case of IllegalTypeConversionException does not write any
> data to the BAOS.
> Such approach would need to buffer all writes to the FlowFile until full
> record is processed, executing them afterwards. This may impact performance.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)