[Lucene-hadoop Wiki] Update of "SequenceFile" by Arun C Murthy

Apache Wiki Wed, 16 Aug 2006 22:14:15 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-hadoop Wiki" for 
change notification.


The following page has been changed by Arun C Murthy:
http://wiki.apache.org/lucene-hadoop/SequenceFile

------------------------------------------------------------------------------
  
  == SequenceFile Formats ==
  
+ This section describes the format for the latest ''''version 4'''' of 
!SequenceFiles.
+ 
  Essentially there are 3 different file formats for !SequenceFiles depending 
on whether ''compression'' and ''block compression'' are active.
  
  [[BR]]
  However all of the above formats share a common ''header'' (which is used by 
the !SequenceFile.Reader to return the appropriate key/value pairs). The next 
section summarises the header:
  [[Anchor(SeqFileHeader)]]
  ===== SequenceFile Common Header =====
-  * version - A byte array: SEQ<version no.>
+  * version - A byte array: 3 bytes of magic header ''''SEQ'''', followed by 1 
byte of actual version no. (e.g. SEQ4)
   * keyClassName - String
   * valueClassName - String
   * compression - A boolean which specifies if ''compression'' is turned on 
for keys/values in this file.
   * blockCompression -  A boolean which specifies if ''block compression'' is 
turned on for keys/values in this file.
   * sync - A sync marker to denote end of the header.
  
+ All strings are serialized using 
[http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/io/Text.html#writeString(java.io.DataOutput,%20java.lang.String)
 Text.writeString] api.
+ 
+ [[BR]]
  [[BR]]
  The formats for Uncompressed/!RecordCompressed Writers are very similar:
  ===== Uncompressed/RecordCompressed Writer Format =====
   * [#SeqFileHeader Header]
   * Record
+    * Record length
+    * Key length
     * Key
     * (Compressed?) Value
   * A sync-marker every few k bytes or so. 
@@ -57, +64 @@

     * !CompressedValuesBlockSize
     * !CompressedValuesBlock
  
+  The compressed blocks of ''key lengths'' and ''value lengths'' consist of 
the actual lengths of individual keys/values encoded in ZeroCompressedInteger 
format.
+

[Lucene-hadoop Wiki] Update of "SequenceFile" by Arun C Murthy

Reply via email to