[
https://issues.apache.org/jira/browse/NIFI-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15864991#comment-15864991
]
Joe Skora commented on NIFI-3055:
---------------------------------
[~markap14], Sorry, I missed that point too, thinking you were referring to
counting the UTF length beforehand. Looking at a pre-check of the original
{{String}} length, it will help if the {{String}} is longer than 65,535
{{chars}} (the arbitrary limit in {{DataOutputStream}}) and the first 65,535
{{chars}} all map to 1 byte UTF-8 values, but if any of the first 65,535
{{chars}} map to 2 or more UTF-8 bytes the exception will still occur and that
initial copy wasted as another copy is made based on the {{chars}} that fit
within the UTF-8. So it will expedite handling of {{Strings}} consisting made
up of 1 byte values such as ASCII text. In the long run, it's probably better
to change it to use {{.writeLongString()}} for all of these {{Strings}} similar
to [NIFI-3389](https://issues.apache.org/jira/browse/NIFI-3389) which will
eliminate these concerns altogether.
> StandardRecordWriter can throw UTFDataFormatException
> -----------------------------------------------------
>
> Key: NIFI-3055
> URL: https://issues.apache.org/jira/browse/NIFI-3055
> Project: Apache NiFi
> Issue Type: Bug
> Affects Versions: 1.0.0, 0.7.1
> Reporter: Brandon DeVries
> Assignee: Joe Skora
> Priority: Blocker
> Fix For: 0.8.0, 1.2.0
>
>
> StandardRecordWriter.writeRecord()\[1] uses DataOutputStream.writeUTF()\[2]
> without checking the length of the value to be written. If this length is
> greater than 65535 (2^16 - 1), you get a UTFDataFormatException "encoded
> string too long..."\[3]. Ultimately, this can result in an
> IllegalStateException\[4], -bringing a halt to the data flow- causing
> PersistentProvenanceRepository "Unable to merge <prov_journal> with other
> Journal Files due to..." WARNings.
> Several of the field values being written in this way are pre-defined, and
> thus not likely an issue. However, the "details" field can be populated by a
> processor, and can be of an arbitrary length. -Additionally, if the detail
> filed is indexed (which I didn't investigate, but I'm sure is easy enough to
> determine), then the length might be subject to the Lucene limit discussed in
> NIFI-2787-.
> \[1]
> https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-provenance-repository-bundle/nifi-persistent-provenance-repository/src/main/java/org/apache/nifi/provenance/StandardRecordWriter.java#L163-L173
> \[2]
> http://docs.oracle.com/javase/7/docs/api/java/io/DataOutputStream.html#writeUTF%28java.lang.String%29
> \[3]
> http://stackoverflow.com/questions/22741556/dataoutputstream-purpose-of-the-encoded-string-too-long-restriction
> \[4]
> https://github.com/apache/nifi/blob/5fd4a55791da27fdba577636ac985a294618328a/nifi-nar-bundles/nifi-provenance-repository-bundle/nifi-persistent-provenance-repository/src/main/java/org/apache/nifi/provenance/PersistentProvenanceRepository.java#L754-L755
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)