Re: AbstractFileOutputOperator maxLength roll over handling

Munagala Ramanath Fri, 11 Dec 2015 07:14:42 -0800

Guess we don't need to worry about the case when the tuple size itself is
larger than the
HDFS block size :-)


Ram

On Fri, Dec 11, 2015 at 12:37 AM, Yogi Devendra <[email protected]>
wrote:

> Hi,
>
> I am using AbstractFileOutputOperator in my application for writing
> incoming tuples into a file on HDFS.
>
> Considering that there could be failover scenarios; I am using
> fileOutputOperator.setMaxLength() for rolling over the files after
> specified length. Assuming that, rolled over files would have faster
> recovery from the failure (since recovery is only for the last part of the
> file and not for the entire file).
>
> To set the maxLength; there is no specific recommended value from the
> usecase. Hence, I would prefer the rolled over file sizes to be equal to
> Block size for HDFS (say 64 MB).
>
> With the current implementation of AbstractFileOutputOperator; actual file
> sizes for the rolled over file would be slightly greater than 64MB. This is
> because, file is being rolled over after the incoming tuple is written to
> to the file. The check for file size (for roll over) happens after the
> tuple is written to the file.
>
> I believe that, files slightly greater than 64MB would result in 2 entries
> on the NameNode. This can be avoided if we flip the sequence of checking
> the file size (adding incoming tuple) and then rolling over to new file
> *before* writing the incoming tuple.
>
> Do you think that, this improvement should be considered? If yes; I will
> create a JIRA and work on it.
>
> Also, does this code change break backward compatibility? Although,
> signature of the API remains same; but there is slight change in the
> semantics. Thus, wanted to get feedback from the community.
>
> ~ Yogi
>

Re: AbstractFileOutputOperator maxLength roll over handling

Reply via email to