In our applications we usually need to know the number of records in a SequenceFile file or in all the SequenceFiles in a directory. It is not possible to use the job output counters as the files sometimes are uploaded to HDFS or moved around from different directories.
We wrote a SequenceFileCounterOutputFormat that extends the SequenceFileOutputFormat wrapping the returned RecordWriter with a proxy that keeps the count of written records and on writer close it writes the counter to a '_FILENAME.counter' file. A couple of static methods allow to retrieve the counter for a file or for all the files in a directory by reading the counter files and adding them up in the later case. Has anybody else such requirements? My concern with this approach is the use of an extra file per file just to keep the counter. A way of addressing my concern would to to modify the SequenceFile so the counter is written at the very end of the file after the synch of the last record (a special synch point could be use to differentiate the EOF from EOR). Then a new method in the SequenceFile would position at the end o the file and read the counter. Thoughts? Thxs. A PS: if the idea of modifying the SequenceFile to support this feature does not fly we can still contribute our SequenceFileCounterOutputFormat to mapred.lib
