Guess we don't need to worry about the case when the tuple size itself is larger than the HDFS block size :-)
Ram On Fri, Dec 11, 2015 at 12:37 AM, Yogi Devendra <[email protected]> wrote: > Hi, > > I am using AbstractFileOutputOperator in my application for writing > incoming tuples into a file on HDFS. > > Considering that there could be failover scenarios; I am using > fileOutputOperator.setMaxLength() for rolling over the files after > specified length. Assuming that, rolled over files would have faster > recovery from the failure (since recovery is only for the last part of the > file and not for the entire file). > > To set the maxLength; there is no specific recommended value from the > usecase. Hence, I would prefer the rolled over file sizes to be equal to > Block size for HDFS (say 64 MB). > > With the current implementation of AbstractFileOutputOperator; actual file > sizes for the rolled over file would be slightly greater than 64MB. This is > because, file is being rolled over after the incoming tuple is written to > to the file. The check for file size (for roll over) happens after the > tuple is written to the file. > > I believe that, files slightly greater than 64MB would result in 2 entries > on the NameNode. This can be avoided if we flip the sequence of checking > the file size (adding incoming tuple) and then rolling over to new file > *before* writing the incoming tuple. > > Do you think that, this improvement should be considered? If yes; I will > create a JIRA and work on it. > > Also, does this code change break backward compatibility? Although, > signature of the API remains same; but there is slight change in the > semantics. Thus, wanted to get feedback from the community. > > ~ Yogi >
