Hello, I am working on a project where we are integrating Samza and Hive. As part of this project, we ran into an issue where sequence files written from Samza were taking a long time (hours) to completely sync with HDFS.
After some Googling and digging into the code, it appears that the issue is here: https://github.com/apache/samza/blob/master/samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/writer/SequenceFileHdfsWriter.scala#L111 Writer.stream(dfs.create(path)) implies that the caller of dfs.create(path) is responsible for closing the created stream explicitly. This doesn't happen, and the SequenceFileHdfsWriter call to close will only flush the stream. I believe the correct line should be: Writer.file(path) Or, SequenceFileHdfsWriter should explicitly track and close the stream. Thanks! Ben Refernece material: http://stackoverflow.com/questions/27916872/why-the-sequencefile-is-truncated https://apache.googlesource.com/hadoop-common/+/HADOOP-6685/src/java/org/apache/hadoop/io/SequenceFile.java#1238