[ 
https://issues.apache.org/jira/browse/NIFI-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15854100#comment-15854100
 ] 

Mark Payne commented on NIFI-3055:
----------------------------------

[~devriesb] [~jskora] [~mosermw] - Sorry, I didn't have a chance to review this 
PR before it got merged in. I do understand the problem and agree this can be 
an issue. I'm a bit concerned with the PR that was merged to master, though, 
for the following reasons:

* writeUTFLimited always attempts to first call out.writeUTF and catches an 
Exception. Under high volume, throwing an Exception here can be a significant 
hit to performance. Should probably first check if string length is > 65536 and 
if so we know that we are going to exceed the length so go ahead and do the 
logic in the 'catch' block. If length is <= 65536 then it's still possible so 
we would still need the try/catch to continue the way that it is, but we should 
avoid the Exception if we know that we will trigger it.

* If we do truncate a value, we should include in the log message which field 
is getting truncated so that it can be troubleshot (is that a word? 
troubleshooted?)

* This solution should work okay for 'Details' field of the Provenance Event, 
as noted by the JIRA description. However, this PR modifies both the 
StandardRecordWriter and the SchemaRecordWriter. The SchemaRecordWriter, 
however, is used in several places - for instance, the 
WriteAheadFlowFileRepository and the FileSystemSwapManager, among others. These 
changes may not be okay for some fields there and it may make more sense in 
those cases to throw the Exception. As a result, I would recommend we update 
the RecordField object to add an 'canTruncate' method or something like that to 
indicate whether or not the field can be truncated. If this returns false, then 
let the UTFDataFormatException fly. Another possible solution to this issue to 
just update the "Details" field of the schema to be of type LONG_STRING instead 
of STRING. Generally, this is frowned upon because the whole point of the 
SchemaRecordWriter is to allow users to roll back from version X of NiFi to 
version X-1 and that won't work if the Field Type of a schema is changed. 
However, I believe NIFI-3389 already will make this true, and I don't know of a 
good way to avoid this. So for 1.2.0 it may make sense to capitalize on this.

* The getCharsInUTFLength method has a comment: see 
java.io.DataOutputStream.writeUTF() - it appears that the next 20 lines or so 
of code come from Oracle/OpenJDK's DataOutputStream classes. This code is NOT 
appropriately licensed for inclusion in ASF. Please correct me if I am wrong 
here.

The last bullet point above is my biggest concern here, though. I'm going to 
re-open the JIRA and mark it as blocker for now, as we cannot release the code 
with this in it if it does in fact come from a License that is not 
ASF-friendly. Again, if I'm wrong on this point, please let me know.

Thanks!
-Mark


> StandardRecordWriter can throw UTFDataFormatException
> -----------------------------------------------------
>
>                 Key: NIFI-3055
>                 URL: https://issues.apache.org/jira/browse/NIFI-3055
>             Project: Apache NiFi
>          Issue Type: Bug
>    Affects Versions: 1.0.0, 0.7.1
>            Reporter: Brandon DeVries
>            Assignee: Joe Skora
>             Fix For: 0.8.0, 1.2.0
>
>
> StandardRecordWriter.writeRecord()\[1] uses DataOutputStream.writeUTF()\[2] 
> without checking the length of the value to be written.  If this length is 
> greater than  65535 (2^16 - 1), you get a UTFDataFormatException "encoded 
> string too long..."\[3].  Ultimately, this can result in an 
> IllegalStateException\[4], -bringing a halt to the data flow- causing 
> PersistentProvenanceRepository "Unable to merge <prov_journal> with other 
> Journal Files due to..." WARNings.
> Several of the field values being written in this way are pre-defined, and 
> thus not likely an issue.  However, the "details" field can be populated by a 
> processor, and can be of an arbitrary length.  -Additionally, if the detail 
> filed is indexed (which I didn't investigate, but I'm sure is easy enough to 
> determine), then the length might be subject to the Lucene limit discussed in 
> NIFI-2787-.
> \[1] 
> https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-provenance-repository-bundle/nifi-persistent-provenance-repository/src/main/java/org/apache/nifi/provenance/StandardRecordWriter.java#L163-L173
> \[2] 
> http://docs.oracle.com/javase/7/docs/api/java/io/DataOutputStream.html#writeUTF%28java.lang.String%29
> \[3] 
> http://stackoverflow.com/questions/22741556/dataoutputstream-purpose-of-the-encoded-string-too-long-restriction
> \[4] 
> https://github.com/apache/nifi/blob/5fd4a55791da27fdba577636ac985a294618328a/nifi-nar-bundles/nifi-provenance-repository-bundle/nifi-persistent-provenance-repository/src/main/java/org/apache/nifi/provenance/PersistentProvenanceRepository.java#L754-L755



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to