ijokarumawak commented on a change in pull request #3573: NIFI-6419: Fixed
AvroWriter single record with external schema result…
URL: https://github.com/apache/nifi/pull/3573#discussion_r302336615
##########
File path:
nifi-nar-bundles/nifi-standard-services/nifi-record-serialization-services-bundle/nifi-record-serialization-services/src/main/java/org/apache/nifi/avro/WriteAvroResultWithExternalSchema.java
##########
@@ -78,14 +78,19 @@ protected void onBeginRecordSet() throws IOException {
@Override
public Map<String, String> writeRecord(final Record record) throws
IOException {
// If we are not writing an active record set, then we need to ensure
that we write the
- // schema information.
+ // schema information at the beginning and call flush at the end.
if (!isActiveRecordSet()) {
flush();
schemaAccessWriter.writeHeader(recordSchema, getOutputStream());
}
final GenericRecord rec = AvroTypeUtil.createAvroRecord(record,
avroSchema);
datumWriter.write(rec, encoder);
+
+ if (!isActiveRecordSet()) {
+ flush();
Review comment:
@turcsanyip I see, you're right, `flush()` is called indirectly from
`onFinishRecordSet()`.
I don't think calling `flush()` from client code is a bad practice. Rather,
I think it's a good practice as we developers don't know if each writer
implementation actually flushes buffered data when its close is called. You may
find these Stackoverflow threads interesting. Most people expect `flush()` is
called automatically when they call `close()`, but relying on that assumption
can cause issues like this, NIFI-6419.
https://stackoverflow.com/questions/2732260/in-java-when-i-call-outputstream-close-do-i-always-need-to-call-outputstream
In that sense, not calling `flush()` within
`WriteAvroResultWithExternalSchema.close()` is correct. There's no contract
that every close implementation should call flush first. However, since the
RecordSetWriter/RecordWriter are used many places and we don't want to check
each of them to confirm whether `flush()` is called properly, I think calling
`flush()` at WriteAvroResultWithExternalSchema is a safer solution in this
case. That's what both of you proposed.
We've been discussing where to add the `flush()` invocation. While I think
both proposal make sense, I still prefer adding `flush()` at `close()`.
Because, in order to fully utilize underlying buffer mechanism, writer
implementation shouldn't call `flush()` too often. In this particular case,
existing `writeRecord(Record record)` already calls `flush()`, so it may not be
that important, though..
Let's consider following possible usage. A developer wants to use
RecordWriter to write a custom file format which can have multiple Avro records
separated by special character followed by some additional information. Then
pseudo code would be:
```
try (output = some_file, writer = createWriter(output)) {
while (hasMoreRecord()) {
writer.write(record)
output.write(additionalInfo)
}
}
```
If we add `flush()` at `write()`, flush will be called at every iteration,
and may result a unoptimized performance.
@pcgrenier already gave a thumbs-up above. If you agree by addressing
NIFI-6419 by commits both of you in order to add flush at close, and helpful
unit tests as I mentioned at the previous comment, I can merge them to master
branch.
@turcsanyip How do you think? Is there any obvious reason we should add
flush at `writeRecord()` instead?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services