Dear Wiki user, You have subscribed to a wiki page or wiki category on "Lucene-hadoop Wiki" for change notification.
The following page has been changed by Arun C Murthy: http://wiki.apache.org/lucene-hadoop/SequenceFile The comment on the change is: First Cut New page: == Overview == SequenceFile is a flat file consisting of binary key/value pairs. It is extensively used in MapReduce as input/output formats. It is also worth noting the the ''output'' of the Map is always a SequenceFile. The SequenceFile provides a Writer, Reader and Sorter classes for writing, reading and sorting respectively. There are 3 different !SequenceFile formats: 1. Uncompressed key/value records. 2. Record compressed key/value records - only 'values' are compressed here. 3. Block compressed key/value records - both keys are values are collected in 'blocks' separately and compressed. The recommended way is to use the SequenceFile.createWriter methods to construct the 'preferred' writer implementation. The [http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/io/SequenceFile.Reader.html SequenceFile.Reader] acts as a bridge and can read any of the above SequenceFile formats. == SequenceFile Formats == Essentially there are 3 different file formats for !SequenceFiles depending on whether ''compression'' and ''block compression'' are active. However any of the above formats share a common ''header'' (which is used by the !SequenceFile.Reader to return the appropriate key/value pairs). The next section summarises the header: [[Anchor(SeqFileHeader)]]===== SequenceFile Common Header ===== * version - A byte array: SEQ<version no.> * keyClassName - String * valueClassName - String * compression - A boolean which specifies if ''compression'' is turned on for keys/values in this file. * blockCompression - A boolean which specifies if ''block compression'' is turned on for keys/values in this file. * sync - A sync marker to denote end of the header. The formats for Uncompressed/!RecordCompressed Writers are very similar: ===== Uncompressed/RecordCompressed Writer Format ===== * [#SeqFileHeader Header] * Record * Key * (Compressed?) Value * A sync-marker every 100bytes or so to help in seeking to a random point in the file and then seeking to next ''record''. <br> The format for the !BlockCompressedWriter is as follows: ===== BlockCompressed Writer Format ===== * [#SeqFileHeader Header] * Record ''Block'' * !CompressedKeyLengthsBlockSize * !CompressedKeyLengthsBlock * !CompressedKeysBlockSize * !CompressedKeysBlock * !CompressedValueLengthsBlockSize * !CompressedValueLengthsBlock * !CompressedValuesBlockSize * !CompressedValuesBlock * A sync-marker to help in seeking to a random point in the file and then seeking to next ''record block''.