[
https://issues.apache.org/jira/browse/NIFI-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15854280#comment-15854280
]
Joe Skora commented on NIFI-3055:
---------------------------------
[~markap14], [~devriesb], [~mosermw], [~joewitt]
* Mark, I see your point regarding license. I will replace that function ASAP
even though it's purpose is fundamentally different than the original code.
JoeW, the fundamental purpose is different, finding the source subset that fits
instead of the UTF length itself, and the logic is based on the ISO Unicode
specification so it shouldn't be a problem to implement a non-infringing check.
I was concerned that it didn't consider 4 byte values anyway.
* Mike and I determined late last week that most uses of {{writeUTF()}} should
probably be replaced with {{writeLongString}} because {{writeUTF()}} has that
hard limit of 64K bytes which will cause problems with internationalized
content. We haven't created a ticket for that yet, but it may not be necessary
if [NIFI-3389](https://issues.apache.org/jira/browse/NIFI-3389) accomplishes
that.
* As for exception handling where {{writeUTF()}} can't be eliminated, adding a
length check before the {{writeUTF()}} call will cause extra processing of
values that don't exceed the limit because {{writeUTF()}} always scans the
entire original String to calculate its UTF length. The {{writeUTFLimited()}}
and {{getCharsInUTFLength()}} logic processes only as much of the original
String as within the limit, never processing the entire original String, so it
is as efficient as possible when the exception does occur. Without the fix,
the exception takes Writers offline and eventually requires an instance
restart, I don't think that is a widespread problem so it seems better to only
perform the extra processing for values after the exception occurs.
> StandardRecordWriter can throw UTFDataFormatException
> -----------------------------------------------------
>
> Key: NIFI-3055
> URL: https://issues.apache.org/jira/browse/NIFI-3055
> Project: Apache NiFi
> Issue Type: Bug
> Affects Versions: 1.0.0, 0.7.1
> Reporter: Brandon DeVries
> Assignee: Joe Skora
> Priority: Blocker
> Fix For: 0.8.0, 1.2.0
>
>
> StandardRecordWriter.writeRecord()\[1] uses DataOutputStream.writeUTF()\[2]
> without checking the length of the value to be written. If this length is
> greater than 65535 (2^16 - 1), you get a UTFDataFormatException "encoded
> string too long..."\[3]. Ultimately, this can result in an
> IllegalStateException\[4], -bringing a halt to the data flow- causing
> PersistentProvenanceRepository "Unable to merge <prov_journal> with other
> Journal Files due to..." WARNings.
> Several of the field values being written in this way are pre-defined, and
> thus not likely an issue. However, the "details" field can be populated by a
> processor, and can be of an arbitrary length. -Additionally, if the detail
> filed is indexed (which I didn't investigate, but I'm sure is easy enough to
> determine), then the length might be subject to the Lucene limit discussed in
> NIFI-2787-.
> \[1]
> https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-provenance-repository-bundle/nifi-persistent-provenance-repository/src/main/java/org/apache/nifi/provenance/StandardRecordWriter.java#L163-L173
> \[2]
> http://docs.oracle.com/javase/7/docs/api/java/io/DataOutputStream.html#writeUTF%28java.lang.String%29
> \[3]
> http://stackoverflow.com/questions/22741556/dataoutputstream-purpose-of-the-encoded-string-too-long-restriction
> \[4]
> https://github.com/apache/nifi/blob/5fd4a55791da27fdba577636ac985a294618328a/nifi-nar-bundles/nifi-provenance-repository-bundle/nifi-persistent-provenance-repository/src/main/java/org/apache/nifi/provenance/PersistentProvenanceRepository.java#L754-L755
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)