[jira] [Commented] (NIFI-3055) StandardRecordWriter can throw UTFDataFormatException

Joe Skora (JIRA) Mon, 06 Feb 2017 08:14:54 -0800

    [ 
https://issues.apache.org/jira/browse/NIFI-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15854280#comment-15854280
 ]


Joe Skora commented on NIFI-3055:
---------------------------------

[~markap14], [~devriesb], [~mosermw], [~joewitt]

* Mark, I see your point regarding license.  I will replace that function ASAP 
even though it's purpose is fundamentally different than the original code.  
JoeW, the fundamental purpose is different, finding the source subset that fits 
instead of the UTF length itself, and the logic is based on the ISO Unicode 
specification so it shouldn't be a problem to implement a non-infringing check. 
 I was concerned that it didn't consider 4 byte values anyway.
* Mike and I determined late last week that most uses of {{writeUTF()}} should 
probably be replaced with {{writeLongString}} because {{writeUTF()}} has that 
hard limit of 64K bytes which will cause problems with internationalized 
content.  We haven't created a ticket for that yet, but it may not be necessary 
if [NIFI-3389](https://issues.apache.org/jira/browse/NIFI-3389) accomplishes 
that.
* As for exception handling where {{writeUTF()}} can't be eliminated, adding a 
length check before the {{writeUTF()}} call will cause extra processing of 
values that don't exceed the limit because {{writeUTF()}} always scans the 
entire original String to calculate its UTF length.  The {{writeUTFLimited()}} 
and {{getCharsInUTFLength()}} logic processes only as much of the original 
String as within the limit, never processing the entire original String, so it 
is as efficient as possible when the exception does occur.  Without the fix, 
the exception takes Writers offline and eventually requires an instance 
restart, I don't think that is a widespread problem so it seems better to only 
perform the extra processing for values after the exception occurs.

> StandardRecordWriter can throw UTFDataFormatException
> -----------------------------------------------------
>
>                 Key: NIFI-3055
>                 URL: https://issues.apache.org/jira/browse/NIFI-3055
>             Project: Apache NiFi
>          Issue Type: Bug
>    Affects Versions: 1.0.0, 0.7.1
>            Reporter: Brandon DeVries
>            Assignee: Joe Skora
>            Priority: Blocker
>             Fix For: 0.8.0, 1.2.0
>
>
> StandardRecordWriter.writeRecord()\[1] uses DataOutputStream.writeUTF()\[2] 
> without checking the length of the value to be written.  If this length is 
> greater than  65535 (2^16 - 1), you get a UTFDataFormatException "encoded 
> string too long..."\[3].  Ultimately, this can result in an 
> IllegalStateException\[4], -bringing a halt to the data flow- causing 
> PersistentProvenanceRepository "Unable to merge <prov_journal> with other 
> Journal Files due to..." WARNings.
> Several of the field values being written in this way are pre-defined, and 
> thus not likely an issue.  However, the "details" field can be populated by a 
> processor, and can be of an arbitrary length.  -Additionally, if the detail 
> filed is indexed (which I didn't investigate, but I'm sure is easy enough to 
> determine), then the length might be subject to the Lucene limit discussed in 
> NIFI-2787-.
> \[1] 
> https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-provenance-repository-bundle/nifi-persistent-provenance-repository/src/main/java/org/apache/nifi/provenance/StandardRecordWriter.java#L163-L173
> \[2] 
> http://docs.oracle.com/javase/7/docs/api/java/io/DataOutputStream.html#writeUTF%28java.lang.String%29
> \[3] 
> http://stackoverflow.com/questions/22741556/dataoutputstream-purpose-of-the-encoded-string-too-long-restriction
> \[4] 
> https://github.com/apache/nifi/blob/5fd4a55791da27fdba577636ac985a294618328a/nifi-nar-bundles/nifi-provenance-repository-bundle/nifi-persistent-provenance-repository/src/main/java/org/apache/nifi/provenance/PersistentProvenanceRepository.java#L754-L755



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (NIFI-3055) StandardRecordWriter can throw UTFDataFormatException

Reply via email to