Hi,

I am using AbstractFileOutputOperator in my application for writing
incoming tuples into a file on HDFS.

Considering that there could be failover scenarios; I am using
fileOutputOperator.setMaxLength() for rolling over the files after
specified length. Assuming that, rolled over files would have faster
recovery from the failure (since recovery is only for the last part of the
file and not for the entire file).

To set the maxLength; there is no specific recommended value from the
usecase. Hence, I would prefer the rolled over file sizes to be equal to
Block size for HDFS (say 64 MB).

With the current implementation of AbstractFileOutputOperator; actual file
sizes for the rolled over file would be slightly greater than 64MB. This is
because, file is being rolled over after the incoming tuple is written to
to the file. The check for file size (for roll over) happens after the
tuple is written to the file.

I believe that, files slightly greater than 64MB would result in 2 entries
on the NameNode. This can be avoided if we flip the sequence of checking
the file size (adding incoming tuple) and then rolling over to new file
*before* writing the incoming tuple.

Do you think that, this improvement should be considered? If yes; I will
create a JIRA and work on it.

Also, does this code change break backward compatibility? Although,
signature of the API remains same; but there is slight change in the
semantics. Thus, wanted to get feedback from the community.

~ Yogi

Reply via email to