Gabor Szadovszky created AVRO-2109:
--------------------------------------
Summary: Reset buffers in case of IOException
Key: AVRO-2109
URL: https://issues.apache.org/jira/browse/AVRO-2109
Project: Avro
Issue Type: Improvement
Components: java
Affects Versions: 1.8.2
Reporter: Gabor Szadovszky
Assignee: Gabor Szadovszky
In case of an {{IOException}} is thrown out from {{DataFileWriter.writeBlock}}
the {{buffer}} and {[blockCount}} are not reset therefore duplicated data is
written out when {{close}}/{{flush}}.
This is actually a conceptual question whether we should reset the buffer or
not in case of an exception. In case of an exception occurs during writing the
file we shall expect that the file will be corrupt. So, the possible
duplication of data shall not matter.
In the other hand if the file is already corrupt why would we try to write
anything again at file close?
This issue comes from a Flume issue where the HDFS wait thread is interrupted
because of a timeout during writing an Avro file. The actual block is properly
written already but because of the {{IOException}} caused by the thread
interrupt we invoke {{close()}} on the writer which writes the block again with
some other stuff (maybe duplicated sync marker) that makes the file corrupt.
[~busbey], [~nkollar], [~zi], any thoughts?
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)