Dear Wiki user, You have subscribed to a wiki page or wiki category on "Lucene-hadoop Wiki" for change notification.
The following page has been changed by Arun C Murthy: http://wiki.apache.org/lucene-hadoop/SequenceFile ------------------------------------------------------------------------------ == SequenceFile Formats == + This section describes the format for the latest ''''version 4'''' of !SequenceFiles. + Essentially there are 3 different file formats for !SequenceFiles depending on whether ''compression'' and ''block compression'' are active. [[BR]] However all of the above formats share a common ''header'' (which is used by the !SequenceFile.Reader to return the appropriate key/value pairs). The next section summarises the header: [[Anchor(SeqFileHeader)]] ===== SequenceFile Common Header ===== - * version - A byte array: SEQ<version no.> + * version - A byte array: 3 bytes of magic header ''''SEQ'''', followed by 1 byte of actual version no. (e.g. SEQ4) * keyClassName - String * valueClassName - String * compression - A boolean which specifies if ''compression'' is turned on for keys/values in this file. * blockCompression - A boolean which specifies if ''block compression'' is turned on for keys/values in this file. * sync - A sync marker to denote end of the header. + All strings are serialized using [http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/io/Text.html#writeString(java.io.DataOutput,%20java.lang.String) Text.writeString] api. + + [[BR]] [[BR]] The formats for Uncompressed/!RecordCompressed Writers are very similar: ===== Uncompressed/RecordCompressed Writer Format ===== * [#SeqFileHeader Header] * Record + * Record length + * Key length * Key * (Compressed?) Value * A sync-marker every few k bytes or so. @@ -57, +64 @@ * !CompressedValuesBlockSize * !CompressedValuesBlock + The compressed blocks of ''key lengths'' and ''value lengths'' consist of the actual lengths of individual keys/values encoded in ZeroCompressedInteger format. +