[Lucene-hadoop Wiki] Update of "SequenceFile" by Arun C Murthy

Apache Wiki Wed, 16 Aug 2006 03:13:16 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-hadoop Wiki" for 
change notification.


The following page has been changed by Arun C Murthy:
http://wiki.apache.org/lucene-hadoop/SequenceFile

The comment on the change is:
First Cut

New page:
== Overview ==

SequenceFile is a flat file consisting of binary key/value pairs. It is 
extensively used in MapReduce as input/output formats.
It is also worth noting the the ''output'' of the Map is always a SequenceFile.

The SequenceFile provides a Writer, Reader and Sorter classes for writing, 
reading and sorting respectively.

There are 3 different !SequenceFile formats:
 1. Uncompressed key/value records.
 2. Record compressed key/value records - only 'values' are compressed here.
 3. Block compressed key/value records - both keys are values are collected in 
'blocks' separately and compressed.

The recommended way is to use the SequenceFile.createWriter methods to 
construct the 'preferred' writer implementation.

The 
[http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/io/SequenceFile.Reader.html
 SequenceFile.Reader] acts as a bridge and can read any of the above 
SequenceFile formats.

== SequenceFile Formats ==

Essentially there are 3 different file formats for !SequenceFiles depending on 
whether ''compression'' and ''block compression'' are active.


However any of the above formats share a common ''header'' (which is used by 
the !SequenceFile.Reader to return the appropriate key/value pairs). The next 
section summarises the header:
[[Anchor(SeqFileHeader)]]===== SequenceFile Common Header =====
 * version - A byte array: SEQ<version no.>
 * keyClassName - String
 * valueClassName - String
 * compression - A boolean which specifies if ''compression'' is turned on for 
keys/values in this file.
 * blockCompression -  A boolean which specifies if ''block compression'' is 
turned on for keys/values in this file.
 * sync - A sync marker to denote end of the header.


The formats for Uncompressed/!RecordCompressed Writers are very similar:
===== Uncompressed/RecordCompressed Writer Format =====
 * [#SeqFileHeader Header]
 * Record
   * Key
   * (Compressed?) Value
 * A sync-marker every 100bytes or so to help in seeking to a random point in 
the file and then seeking to next ''record''.
<br>

The format for the !BlockCompressedWriter is as follows:
===== BlockCompressed Writer Format =====
 * [#SeqFileHeader Header]
 * Record ''Block''
   * !CompressedKeyLengthsBlockSize
   * !CompressedKeyLengthsBlock
   * !CompressedKeysBlockSize
   * !CompressedKeysBlock
   * !CompressedValueLengthsBlockSize
   * !CompressedValueLengthsBlock
   * !CompressedValuesBlockSize
   * !CompressedValuesBlock
   * A sync-marker to help in seeking to a random point in the file and then 
seeking to next ''record block''.

[Lucene-hadoop Wiki] Update of "SequenceFile" by Arun C Murthy

Reply via email to