ijokarumawak commented on a change in pull request #3573: NIFI-6419: Fixed 
AvroWriter single record with external schema result…
URL: https://github.com/apache/nifi/pull/3573#discussion_r302336615
 
 

 ##########
 File path: 
nifi-nar-bundles/nifi-standard-services/nifi-record-serialization-services-bundle/nifi-record-serialization-services/src/main/java/org/apache/nifi/avro/WriteAvroResultWithExternalSchema.java
 ##########
 @@ -78,14 +78,19 @@ protected void onBeginRecordSet() throws IOException {
     @Override
     public Map<String, String> writeRecord(final Record record) throws 
IOException {
         // If we are not writing an active record set, then we need to ensure 
that we write the
-        // schema information.
+        // schema information at the beginning and call flush at the end.
         if (!isActiveRecordSet()) {
             flush();
             schemaAccessWriter.writeHeader(recordSchema, getOutputStream());
         }
 
         final GenericRecord rec = AvroTypeUtil.createAvroRecord(record, 
avroSchema);
         datumWriter.write(rec, encoder);
+
+        if (!isActiveRecordSet()) {
+            flush();
 
 Review comment:
   @turcsanyip I see, you're right, `flush()` is called indirectly from 
`onFinishRecordSet()`.
   
   I don't think calling `flush()` from client code is a bad practice. Rather, 
I think it's a good practice as we developers don't know if each writer 
implementation actually flushes buffered data when its close is called. You may 
find these Stackoverflow threads interesting. Most people expect `flush()` is 
called automatically when they call `close()`, but relying on that assumption 
can cause issues like this, NIFI-6419.
   
https://stackoverflow.com/questions/2732260/in-java-when-i-call-outputstream-close-do-i-always-need-to-call-outputstream
   
   In that sense, not calling `flush()` within 
`WriteAvroResultWithExternalSchema.close()` is correct. There's no contract 
that every close implementation should call flush first. However, since the 
RecordSetWriter/RecordWriter are used many places and we don't want to check 
each of them to confirm whether `flush()` is called properly, I think calling 
`flush()` at WriteAvroResultWithExternalSchema is a safer solution in this 
case. That's what both of you proposed.
   
   We've been discussing where to add the `flush()` invocation. While I think 
both proposal make sense, I still prefer adding `flush()` at `close()`. 
Because, in order to fully utilize underlying buffer mechanism, writer 
implementation shouldn't call `flush()` too often. In this particular case, 
existing `writeRecord(Record record)` already calls `flush()`, so it may not be 
that important, though..
   
   Let's consider following possible usage. A developer wants to use 
RecordWriter to write a custom file format which can have multiple Avro records 
separated by special character followed by some additional information. Then 
pseudo code would be:
   ```
   try (output = some_file, writer = createWriter(output)) {
       while (hasMoreRecord()) {
           writer.write(record)
           output.write(additionalInfo)
       }
   }
   ```
   
   If we add `flush()` at `write()`, flush will be called at every iteration, 
and may result a unoptimized performance.
   
   @pcgrenier already gave a thumbs-up above. If you agree by addressing 
NIFI-6419 by commits both of you in order to add flush at close, and helpful 
unit tests as I mentioned at the previous comment, I can merge them to master 
branch.
   @turcsanyip How do you think? Is there any obvious reason we should add 
flush at `writeRecord()` instead?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to