[
https://issues.apache.org/jira/browse/NIFI-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15854100#comment-15854100
]
Mark Payne commented on NIFI-3055:
----------------------------------
[~devriesb] [~jskora] [~mosermw] - Sorry, I didn't have a chance to review this
PR before it got merged in. I do understand the problem and agree this can be
an issue. I'm a bit concerned with the PR that was merged to master, though,
for the following reasons:
* writeUTFLimited always attempts to first call out.writeUTF and catches an
Exception. Under high volume, throwing an Exception here can be a significant
hit to performance. Should probably first check if string length is > 65536 and
if so we know that we are going to exceed the length so go ahead and do the
logic in the 'catch' block. If length is <= 65536 then it's still possible so
we would still need the try/catch to continue the way that it is, but we should
avoid the Exception if we know that we will trigger it.
* If we do truncate a value, we should include in the log message which field
is getting truncated so that it can be troubleshot (is that a word?
troubleshooted?)
* This solution should work okay for 'Details' field of the Provenance Event,
as noted by the JIRA description. However, this PR modifies both the
StandardRecordWriter and the SchemaRecordWriter. The SchemaRecordWriter,
however, is used in several places - for instance, the
WriteAheadFlowFileRepository and the FileSystemSwapManager, among others. These
changes may not be okay for some fields there and it may make more sense in
those cases to throw the Exception. As a result, I would recommend we update
the RecordField object to add an 'canTruncate' method or something like that to
indicate whether or not the field can be truncated. If this returns false, then
let the UTFDataFormatException fly. Another possible solution to this issue to
just update the "Details" field of the schema to be of type LONG_STRING instead
of STRING. Generally, this is frowned upon because the whole point of the
SchemaRecordWriter is to allow users to roll back from version X of NiFi to
version X-1 and that won't work if the Field Type of a schema is changed.
However, I believe NIFI-3389 already will make this true, and I don't know of a
good way to avoid this. So for 1.2.0 it may make sense to capitalize on this.
* The getCharsInUTFLength method has a comment: see
java.io.DataOutputStream.writeUTF() - it appears that the next 20 lines or so
of code come from Oracle/OpenJDK's DataOutputStream classes. This code is NOT
appropriately licensed for inclusion in ASF. Please correct me if I am wrong
here.
The last bullet point above is my biggest concern here, though. I'm going to
re-open the JIRA and mark it as blocker for now, as we cannot release the code
with this in it if it does in fact come from a License that is not
ASF-friendly. Again, if I'm wrong on this point, please let me know.
Thanks!
-Mark
> StandardRecordWriter can throw UTFDataFormatException
> -----------------------------------------------------
>
> Key: NIFI-3055
> URL: https://issues.apache.org/jira/browse/NIFI-3055
> Project: Apache NiFi
> Issue Type: Bug
> Affects Versions: 1.0.0, 0.7.1
> Reporter: Brandon DeVries
> Assignee: Joe Skora
> Fix For: 0.8.0, 1.2.0
>
>
> StandardRecordWriter.writeRecord()\[1] uses DataOutputStream.writeUTF()\[2]
> without checking the length of the value to be written. If this length is
> greater than 65535 (2^16 - 1), you get a UTFDataFormatException "encoded
> string too long..."\[3]. Ultimately, this can result in an
> IllegalStateException\[4], -bringing a halt to the data flow- causing
> PersistentProvenanceRepository "Unable to merge <prov_journal> with other
> Journal Files due to..." WARNings.
> Several of the field values being written in this way are pre-defined, and
> thus not likely an issue. However, the "details" field can be populated by a
> processor, and can be of an arbitrary length. -Additionally, if the detail
> filed is indexed (which I didn't investigate, but I'm sure is easy enough to
> determine), then the length might be subject to the Lucene limit discussed in
> NIFI-2787-.
> \[1]
> https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-provenance-repository-bundle/nifi-persistent-provenance-repository/src/main/java/org/apache/nifi/provenance/StandardRecordWriter.java#L163-L173
> \[2]
> http://docs.oracle.com/javase/7/docs/api/java/io/DataOutputStream.html#writeUTF%28java.lang.String%29
> \[3]
> http://stackoverflow.com/questions/22741556/dataoutputstream-purpose-of-the-encoded-string-too-long-restriction
> \[4]
> https://github.com/apache/nifi/blob/5fd4a55791da27fdba577636ac985a294618328a/nifi-nar-bundles/nifi-provenance-repository-bundle/nifi-persistent-provenance-repository/src/main/java/org/apache/nifi/provenance/PersistentProvenanceRepository.java#L754-L755
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)