[1/2] orc git commit: ORC-339. Reorganize the ORC file format specification.

omalley Tue, 17 Apr 2018 10:46:58 -0700

Repository: orc
Updated Branches:
  refs/heads/master 2a5ad9de7 -> 240931203



http://git-wip-us.apache.org/repos/asf/orc/blob/24093120/site/specification/ORCv1.md
----------------------------------------------------------------------
diff --git a/site/specification/ORCv1.md b/site/specification/ORCv1.md
new file mode 100644
index 0000000..23e9935
--- /dev/null
+++ b/site/specification/ORCv1.md
@@ -0,0 +1,1012 @@
+---
+layout: page
+title: ORC Specification v1
+---
+
+This version of the file format was originally released as part of
+Hive 0.12.
+
+# Motivation
+
+Hive's RCFile was the standard format for storing tabular data in
+Hadoop for several years. However, RCFile has limitations because it
+treats each column as a binary blob without semantics. In Hive 0.11 we
+added a new file format named Optimized Row Columnar (ORC) file that
+uses and retains the type information from the table definition. ORC
+uses type specific readers and writers that provide light weight
+compression techniques such as dictionary encoding, bit packing, delta
+encoding, and run length encoding -- resulting in dramatically smaller
+files. Additionally, ORC can apply generic compression using zlib, or
+Snappy on top of the lightweight compression for even smaller
+files. However, storage savings are only part of the gain. ORC
+supports projection, which selects subsets of the columns for reading,
+so that queries reading only one column read only the required
+bytes. Furthermore, ORC files include light weight indexes that
+include the minimum and maximum values for each column in each set of
+10,000 rows and the entire file. Using pushdown filters from Hive, the
+file reader can skip entire sets of rows that aren't important for
+this query.
+
+![ORC file structure]({{ site.url }}/img/OrcFileLayout.png)
+
+# File Tail
+
+Since HDFS does not support changing the data in a file after it is
+written, ORC stores the top level index at the end of the file. The
+overall structure of the file is given in the figure above.  The
+file's tail consists of 3 parts; the file metadata, file footer and
+postscript.
+
+The metadata for ORC is stored using
+[Protocol Buffers](https://s.apache.org/protobuf_encoding), which provides
+the ability to add new fields without breaking readers. This document
+incorporates the Protobuf definition from the
+[ORC source code](https://s.apache.org/orc_proto) and the
+reader is encouraged to review the Protobuf encoding if they need to
+understand the byte-level encoding
+
+## Postscript
+
+The Postscript section provides the necessary information to interpret
+the rest of the file including the length of the file's Footer and
+Metadata sections, the version of the file, and the kind of general
+compression used (eg. none, zlib, or snappy). The Postscript is never
+compressed and ends one byte before the end of the file. The version
+stored in the Postscript is the lowest version of Hive that is
+guaranteed to be able to read the file and it stored as a sequence of
+the major and minor version. This file version is encoded as [0,12].
+
+The process of reading an ORC file works backwards through the
+file. Rather than making multiple short reads, the ORC reader reads
+the last 16k bytes of the file with the hope that it will contain both
+the Footer and Postscript sections. The final byte of the file
+contains the serialized length of the Postscript, which must be less
+than 256 bytes. Once the Postscript is parsed, the compressed
+serialized length of the Footer is known and it can be decompressed
+and parsed.
+
+```message PostScript {
+ // the length of the footer section in bytes
+ optional uint64 footerLength = 1;
+ // the kind of generic compression used
+ optional CompressionKind compression = 2;
+ // the maximum size of each compression chunk
+ optional uint64 compressionBlockSize = 3;
+ // the version of the writer
+ repeated uint32 version = 4 [packed = true];
+ // the length of the metadata section in bytes
+ optional uint64 metadataLength = 5;
+ // the fixed string "ORC"
+ optional string magic = 8000;
+}
+```
+
+```enum CompressionKind {
+ NONE = 0;
+ ZLIB = 1;
+ SNAPPY = 2;
+ LZO = 3;
+ LZ4 = 4;
+ ZSTD = 5;
+}
+```
+
+## Footer
+
+The Footer section contains the layout of the body of the file, the
+type schema information, the number of rows, and the statistics about
+each of the columns.
+
+The file is broken in to three parts- Header, Body, and Tail. The
+Header consists of the bytes "ORC'' to support tools that want to
+scan the front of the file to determine the type of the file. The Body
+contains the rows and indexes, and the Tail gives the file level
+information as described in this section.
+
+```message Footer {
+ // the length of the file header in bytes (always 3)
+ optional uint64 headerLength = 1;
+ // the length of the file header and body in bytes
+ optional uint64 contentLength = 2;
+ // the information about the stripes
+ repeated StripeInformation stripes = 3;
+ // the schema information
+ repeated Type types = 4;
+ // the user metadata that was added
+ repeated UserMetadataItem metadata = 5;
+ // the total number of rows in the file
+ optional uint64 numberOfRows = 6;
+ // the statistics of each column across the file
+ repeated ColumnStatistics statistics = 7;
+ // the maximum number of rows in each index entry
+ optional uint32 rowIndexStride = 8;
+}
+```
+
+### Stripe Information
+
+The body of the file is divided into stripes. Each stripe is self
+contained and may be read using only its own bytes combined with the
+file's Footer and Postscript. Each stripe contains only entire rows so
+that rows never straddle stripe boundaries. Stripes have three
+sections: a set of indexes for the rows within the stripe, the data
+itself, and a stripe footer. Both the indexes and the data sections
+are divided by columns so that only the data for the required columns
+needs to be read.
+
+```message StripeInformation {
+ // the start of the stripe within the file
+ optional uint64 offset = 1;
+ // the length of the indexes in bytes
+ optional uint64 indexLength = 2;
+ // the length of the data in bytes
+ optional uint64 dataLength = 3;
+ // the length of the footer in bytes
+ optional uint64 footerLength = 4;
+ // the number of rows in the stripe
+ optional uint64 numberOfRows = 5;
+}
+```
+
+### Type Information
+
+All of the rows in an ORC file must have the same schema. Logically
+the schema is expressed as a tree as in the figure below, where
+the compound types have subcolumns under them.
+
+![ORC column structure]({{ site.url }}/img/TreeWriters.png)
+
+The equivalent Hive DDL would be:
+
+```create table Foobar (
+ myInt int,
+ myMap map<string,
+ struct<myString : string,
+ myDouble: double>>,
+ myTime timestamp
+);
+```
+
+The type tree is flattened in to a list via a pre-order traversal
+where each type is assigned the next id. Clearly the root of the type
+tree is always type id 0. Compound types have a field named subtypes
+that contains the list of their children's type ids.
+
+```message Type {
+ enum Kind {
+ BOOLEAN = 0;
+ BYTE = 1;
+ SHORT = 2;
+ INT = 3;
+ LONG = 4;
+ FLOAT = 5;
+ DOUBLE = 6;
+ STRING = 7;
+ BINARY = 8;
+ TIMESTAMP = 9;
+ LIST = 10;
+ MAP = 11;
+ STRUCT = 12;
+ UNION = 13;
+ DECIMAL = 14;
+ DATE = 15;
+ VARCHAR = 16;
+ CHAR = 17;
+ }
+ // the kind of this type
+ required Kind kind = 1;
+ // the type ids of any subcolumns for list, map, struct, or union
+ repeated uint32 subtypes = 2 [packed=true];
+ // the list of field names for struct
+ repeated string fieldNames = 3;
+ // the maximum length of the type for varchar or char in UTF-8 characters
+ optional uint32 maximumLength = 4;
+ // the precision and scale for decimal
+ optional uint32 precision = 5;
+ optional uint32 scale = 6;
+}
+```
+
+### Column Statistics
+
+The goal of the column statistics is that for each column, the writer
+records the count and depending on the type other useful fields. For
+most of the primitive types, it records the minimum and maximum
+values; and for numeric types it additionally stores the sum.
+From Hive 1.1.0 onwards, the column statistics will also record if
+there are any null values within the row group by setting the hasNull flag.
+The hasNull flag is used by ORC's predicate pushdown to better answer
+'IS NULL' queries.
+
+```message ColumnStatistics {
+ // the number of values
+ optional uint64 numberOfValues = 1;
+ // At most one of these has a value for any column
+ optional IntegerStatistics intStatistics = 2;
+ optional DoubleStatistics doubleStatistics = 3;
+ optional StringStatistics stringStatistics = 4;
+ optional BucketStatistics bucketStatistics = 5;
+ optional DecimalStatistics decimalStatistics = 6;
+ optional DateStatistics dateStatistics = 7;
+ optional BinaryStatistics binaryStatistics = 8;
+ optional TimestampStatistics timestampStatistics = 9;
+ optional bool hasNull = 10;
+}
+```
+
+For integer types (tinyint, smallint, int, bigint), the column
+statistics includes the minimum, maximum, and sum. If the sum
+overflows long at any point during the calculation, no sum is
+recorded.
+
+```message IntegerStatistics {
+ optional sint64 minimum = 1;
+ optional sint64 maximum = 2;
+ optional sint64 sum = 3;
+}
+```
+
+For floating point types (float, double), the column statistics
+include the minimum, maximum, and sum. If the sum overflows a double,
+no sum is recorded.
+
+```message DoubleStatistics {
+ optional double minimum = 1;
+ optional double maximum = 2;
+ optional double sum = 3;
+}
+```
+
+For strings, the minimum value, maximum value, and the sum of the
+lengths of the values are recorded.
+
+```message StringStatistics {
+ optional string minimum = 1;
+ optional string maximum = 2;
+ // sum will store the total length of all strings
+ optional sint64 sum = 3;
+}
+```
+
+For booleans, the statistics include the count of false and true values.
+
+```message BucketStatistics {
+ repeated uint64 count = 1 [packed=true];
+}
+```
+
+For decimals, the minimum, maximum, and sum are stored.
+
+```message DecimalStatistics {
+ optional string minimum = 1;
+ optional string maximum = 2;
+ optional string sum = 3;
+}
+```
+
+Date columns record the minimum and maximum values as the number of
+days since the epoch (1/1/2015).
+
+```message DateStatistics {
+ // min,max values saved as days since epoch
+ optional sint32 minimum = 1;
+ optional sint32 maximum = 2;
+}
+```
+
+Timestamp columns record the minimum and maximum values as the number of
+milliseconds since the epoch (1/1/2015).
+
+```message TimestampStatistics {
+ // min,max values saved as milliseconds since epoch
+ optional sint64 minimum = 1;
+ optional sint64 maximum = 2;
+}
+```
+
+Binary columns store the aggregate number of bytes across all of the values.
+
+```message BinaryStatistics {
+ // sum will store the total binary blob length
+ optional sint64 sum = 1;
+}
+```
+
+### User Metadata
+
+The user can add arbitrary key/value pairs to an ORC file as it is
+written. The contents of the keys and values are completely
+application defined, but the key is a string and the value is
+binary. Care should be taken by applications to make sure that their
+keys are unique and in general should be prefixed with an organization
+code.
+
+```message UserMetadataItem {
+ // the user defined key
+ required string name = 1;
+ // the user defined binary value
+ required bytes value = 2;
+}
+```
+
+### File Metadata
+
+The file Metadata section contains column statistics at the stripe
+level granularity. These statistics enable input split elimination
+based on the predicate push-down evaluated per a stripe.
+
+```message StripeStatistics {
+ repeated ColumnStatistics colStats = 1;
+}
+```
+
+```message Metadata {
+ repeated StripeStatistics stripeStats = 1;
+}
+```
+
+# Compression
+
+If the ORC file writer selects a generic compression codec (zlib or
+snappy), every part of the ORC file except for the Postscript is
+compressed with that codec. However, one of the requirements for ORC
+is that the reader be able to skip over compressed bytes without
+decompressing the entire stream. To manage this, ORC writes compressed
+streams in chunks with headers as in the figure below.
+To handle uncompressable data, if the compressed data is larger than
+the original, the original is stored and the isOriginal flag is
+set. Each header is 3 bytes long with (compressedLength * 2 +
+isOriginal) stored as a little endian value. For example, the header
+for a chunk that compressed to 100,000 bytes would be [0x40, 0x0d,
+0x03]. The header for 5 bytes that did not compress would be [0x0b,
+0x00, 0x00]. Each compression chunk is compressed independently so
+that as long as a decompressor starts at the top of a header, it can
+start decompressing without the previous bytes.
+
+![compression streams]({{ site.url }}/img/CompressionStream.png)
+
+The default compression chunk size is 256K, but writers can choose
+their own value. Larger chunks lead to better compression, but require
+more memory. The chunk size is recorded in the Postscript so that
+readers can allocate appropriately sized buffers. Readers are
+guaranteed that no chunk will expand to more than the compression chunk
+size.
+
+ORC files without generic compression write each stream directly
+with no headers.
+
+# Run Length Encoding
+
+## Base 128 Varint
+
+Variable width integer encodings take advantage of the fact that most
+numbers are small and that having smaller encodings for small numbers
+shrinks the overall size of the data. ORC uses the varint format from
+Protocol Buffers, which writes data in little endian format using the
+low 7 bits of each byte. The high bit in each byte is set if the
+number continues into the next byte.
+
+Unsigned Original | Serialized
+:---------------- | :---------
+0                 | 0x00
+1                 | 0x01
+127               | 0x7f
+128               | 0x80, 0x01
+129               | 0x81, 0x01
+16,383            | 0xff, 0x7f
+16,384            | 0x80, 0x80, 0x01
+16,385            | 0x81, 0x80, 0x01
+
+For signed integer types, the number is converted into an unsigned
+number using a zigzag encoding. Zigzag encoding moves the sign bit to
+the least significant bit using the expression (val << 1) ^ (val >>
+63) and derives its name from the fact that positive and negative
+numbers alternate once encoded. The unsigned number is then serialized
+as above.
+
+Signed Original | Unsigned
+:-------------- | :-------
+0               | 0
+-1              | 1
+1               | 2
+-2              | 3
+2               | 4
+
+## Byte Run Length Encoding
+
+For byte streams, ORC uses a very light weight encoding of identical
+values.
+
+* Run - a sequence of at least 3 identical values
+* Literals - a sequence of non-identical values
+
+The first byte of each group of values is a header than determines
+whether it is a run (value between 0 to 127) or literal list (value
+between -128 to -1). For runs, the control byte is the length of the
+run minus the length of the minimal run (3) and the control byte for
+literal lists is the negative length of the list. For example, a
+hundred 0's is encoded as [0x61, 0x00] and the sequence 0x44, 0x45
+would be encoded as [0xfe, 0x44, 0x45]. The next group can choose
+either of the encodings.
+
+## Boolean Run Length Encoding
+
+For encoding boolean types, the bits are put in the bytes from most
+significant to least significant. The bytes are encoded using byte run
+length encoding as described in the previous section. For example,
+the byte sequence [0xff, 0x80] would be one true followed by
+seven false values.
+
+## Integer Run Length Encoding, version 1
+
+In Hive 0.11 ORC files used Run Length Encoding version 1 (RLEv1),
+which provides a lightweight compression of signed or unsigned integer
+sequences. RLEv1 has two sub-encodings:
+
+* Run - a sequence of values that differ by a small fixed delta
+* Literals - a sequence of varint encoded values
+
+Runs start with an initial byte of 0x00 to 0x7f, which encodes the
+length of the run - 3. A second byte provides the fixed delta in the
+range of -128 to 127. Finally, the first value of the run is encoded
+as a base 128 varint.
+
+For example, if the sequence is 100 instances of 7 the encoding would
+start with 100 - 3, followed by a delta of 0, and a varint of 7 for
+an encoding of [0x61, 0x00, 0x07]. To encode the sequence of numbers
+running from 100 to 1, the first byte is 100 - 3, the delta is -1,
+and the varint is 100 for an encoding of [0x61, 0xff, 0x64].
+
+Literals start with an initial byte of 0x80 to 0xff, which corresponds
+to the negative of number of literals in the sequence. Following the
+header byte, the list of N varints is encoded. Thus, if there are
+no runs, the overhead is 1 byte for each 128 integers. The first 5
+prime numbers [2, 3, 4, 7, 11] would encoded as [0xfb, 0x02, 0x03,
+0x04, 0x07, 0xb].
+
+## Integer Run Length Encoding, version 2
+
+In Hive 0.12, ORC introduced Run Length Encoding version 2 (RLEv2),
+which has improved compression and fixed bit width encodings for
+faster expansion. RLEv2 uses four sub-encodings based on the data:
+
+* Short Repeat - used for short sequences with repeated values
+* Direct - used for random sequences with a fixed bit width
+* Patched Base - used for random sequences with a variable bit width
+* Delta - used for monotonically increasing or decreasing sequences
+
+### Short Repeat
+
+The short repeat encoding is used for short repeating integer
+sequences with the goal of minimizing the overhead of the header. All
+of the bits listed in the header are from the first byte to the last
+and from most significant bit to least significant bit. If the type is
+signed, the value is zigzag encoded.
+
+* 1 byte header
+  * 2 bits for encoding type (0)
+  * 3 bits for width (W) of repeating value (1 to 8 bytes)
+  * 3 bits for repeat count (3 to 10 values)
+* W bytes in big endian format, which is zigzag encoded if they type
+  is signed
+
+The unsigned sequence of [10000, 10000, 10000, 10000, 10000] would be
+serialized with short repeat encoding (0), a width of 2 bytes (1), and
+repeat count of 5 (2) as [0x0a, 0x27, 0x10].
+
+### Direct
+
+The direct encoding is used for integer sequences whose values have a
+relatively constant bit width. It encodes the values directly using a
+fixed width big endian encoding. The width of the values is encoded
+using the table below.
+ 
+The 5 bit width encoding table for RLEv2:
+
+Width in Bits | Encoded Value | Notes
+:------------ | :------------ | :----
+0             | 0             | for delta encoding
+1             | 0             | for non-delta encoding
+2             | 1
+4             | 3
+8             | 7
+16            | 15
+24            | 23
+32            | 27
+40            | 28
+48            | 29
+56            | 30
+64            | 31
+3             | 2             | deprecated
+5 <= x <= 7   | x - 1         | deprecated
+9 <= x <= 15  | x - 1         | deprecated
+17 <= x <= 21 | x - 1         | deprecated
+26            | 24            | deprecated
+28            | 25            | deprecated
+30            | 26            | deprecated
+
+* 2 bytes header
+  * 2 bits for encoding type (1)
+  * 5 bits for encoded width (W) of values (1 to 64 bits) using the 5 bit
+    width encoding table
+  * 9 bits for length (L) (1 to 512 values)
+* W * L bits (padded to the next byte) encoded in big endian format, which is
+  zigzag encoding if the type is signed
+
+The unsigned sequence of [23713, 43806, 57005, 48879] would be
+serialized with direct encoding (1), a width of 16 bits (15), and
+length of 4 (3) as [0x5e, 0x03, 0x5c, 0xa1, 0xab, 0x1e, 0xde, 0xad,
+0xbe, 0xef].
+
+### Patched Base
+
+The patched base encoding is used for integer sequences whose bit
+widths varies a lot. The minimum signed value of the sequence is found
+and subtracted from the other values. The bit width of those adjusted
+values is analyzed and the 90 percentile of the bit width is chosen
+as W. The 10\% of values larger than W use patches from a patch list
+to set the additional bits. Patches are encoded as a list of gaps in
+the index values and the additional value bits.
+
+* 4 bytes header
+  * 2 bits for encoding type (2)
+  * 5 bits for encoded width (W) of values (1 to 64 bits) using the 5 bit
+      width encoding table
+  * 9 bits for length (L) (1 to 512 values)
+  * 3 bits for base value width (BW) (1 to 8 bytes)
+  * 5 bits for patch width (PW) (1 to 64 bits) using  the 5 bit width
+    encoding table
+  * 3 bits for patch gap width (PGW) (1 to 8 bits)
+  * 5 bits for patch list length (PLL) (0 to 31 patches)
+* Base value (BW bytes) - The base value is stored as a big endian value
+  with negative values marked by the most significant bit set. If it that
+  bit is set, the entire value is negated.
+* Data values (W * L bits padded to the byte) - A sequence of W bit positive
+  values that are added to the base value.
+* Data values (W * L bits padded to the byte) - A sequence of W bit positive
+  values that are added to the base value.
+* Patch list (PLL * (PGW + PW) bytes) - A list of patches for values
+  that didn't fit within W bits. Each entry in the list consists of a
+  gap, which is the number of elements skipped from the previous
+  patch, and a patch value. Patches are applied by logically or'ing
+  the data values with the relevant patch shifted W bits left. If a
+  patch is 0, it was introduced to skip over more than 255 items. The
+  combined length of each patch (PGW + PW) must be less or equal to
+  64.
+
+The unsigned sequence of [2030, 2000, 2020, 1000000, 2040, 2050, 2060, 2070,
+2080, 2090, 2100, 2110, 2120, 2130, 2140, 2150, 2160, 2170, 2180, 2190]
+has a minimum of 2000, which makes the adjusted
+sequence [30, 0, 20, 998000, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140,
+150, 160, 170, 180, 190]. It has an
+encoding of patched base (2), a bit width of 8 (7), a length of 20
+(19), a base value width of 2 bytes (1), a patch width of 12 bits (11),
+patch gap width of 2 bits (1), and a patch list length of 1 (1). The
+base value is 2000 and the combined result is [0x8e, 0x13, 0x2b, 0x21, 0x07,
+0xd0, 0x1e, 0x00, 0x14, 0x70, 0x28, 0x32, 0x3c, 0x46, 0x50, 0x5a, 0x64, 0x6e,
+0x78, 0x82, 0x8c, 0x96, 0xa0, 0xaa, 0xb4, 0xbe, 0xfc, 0xe8]
+
+### Delta
+
+The Delta encoding is used for monotonically increasing or decreasing
+sequences. The first two numbers in the sequence can not be identical,
+because the encoding is using the sign of the first delta to determine
+if the series is increasing or decreasing.
+
+* 2 bytes header
+  * 2 bits for encoding type (3)
+  * 5 bits for encoded width (W) of deltas (0 to 64 bits) using the 5 bit
+    width encoding table
+  * 9 bits for run length (L) (1 to 512 values)
+* Base value - encoded as (signed or unsigned) varint
+* Delta base - encoded as signed varint
+* Delta values $W * (L - 2)$ bytes - encode each delta after the first
+  one. If the delta base is positive, the sequence is increasing and if it is
+  negative the sequence is decreasing.
+
+The unsigned sequence of [2, 3, 5, 7, 11, 13, 17, 19, 23, 29] would be
+serialized with delta encoding (3), a width of 4 bits (3), length of
+10 (9), a base of 2 (2), and first delta of 1 (2). The resulting
+sequence is [0xc6, 0x09, 0x02, 0x02, 0x22, 0x42, 0x42, 0x46].
+
+# Stripes
+
+The body of ORC files consists of a series of stripes. Stripes are
+large (typically ~200MB) and independent of each other and are often
+processed by different tasks. The defining characteristic for columnar
+storage formats is that the data for each column is stored separately
+and that reading data out of the file should be proportional to the
+number of columns read.
+
+In ORC files, each column is stored in several streams that are stored
+next to each other in the file. For example, an integer column is
+represented as two streams PRESENT, which uses one with a bit per
+value recording if the value is non-null, and DATA, which records the
+non-null values. If all of a column's values in a stripe are non-null,
+the PRESENT stream is omitted from the stripe. For binary data, ORC
+uses three streams PRESENT, DATA, and LENGTH, which stores the length
+of each value. The details of each type will be presented in the
+following subsections.
+
+## Stripe Footer
+
+The stripe footer contains the encoding of each column and the
+directory of the streams including their location.
+
+```message StripeFooter {
+ // the location of each stream
+ repeated Stream streams = 1;
+ // the encoding of each column
+ repeated ColumnEncoding columns = 2;
+}
+```
+
+To describe each stream, ORC stores the kind of stream, the column id,
+and the stream's size in bytes. The details of what is stored in each stream
+depends on the type and encoding of the column.
+
+```message Stream {
+ enum Kind {
+ // boolean stream of whether the next value is non-null
+ PRESENT = 0;
+ // the primary data stream
+ DATA = 1;
+ // the length of each value for variable length data
+ LENGTH = 2;
+ // the dictionary blob
+ DICTIONARY\_DATA = 3;
+ // deprecated prior to Hive 0.11
+ // It was used to store the number of instances of each value in the
+ // dictionary
+ DICTIONARY_COUNT = 4;
+ // a secondary data stream
+ SECONDARY = 5;
+ // the index for seeking to particular row groups
+ ROW_INDEX = 6;
+ // original bloom filters used before ORC-101
+ BLOOM_FILTER = 7;
+ // bloom filters that consistently use utf8
+ BLOOM_FILTER_UTF8 = 8;
+ }
+ required Kind kind = 1;
+ // the column id
+ optional uint32 column = 2;
+ // the number of bytes in the file
+ optional uint64 length = 3;
+}
+```
+
+Depending on their type several options for encoding are possible. The
+encodings are divided into direct or dictionary-based categories and
+further refined as to whether they use RLE v1 or v2.
+
+```message ColumnEncoding {
+ enum Kind {
+ // the encoding is mapped directly to the stream using RLE v1
+ DIRECT = 0;
+ // the encoding uses a dictionary of unique values using RLE v1
+ DICTIONARY = 1;
+ // the encoding is direct using RLE v2
+ DIRECT\_V2 = 2;
+ // the encoding is dictionary-based using RLE v2
+ DICTIONARY\_V2 = 3;
+ }
+ required Kind kind = 1;
+ // for dictionary encodings, record the size of the dictionary
+ optional uint32 dictionarySize = 2;
+}
+```
+
+# Column Encodings
+
+## SmallInt, Int, and BigInt Columns
+
+All of the 16, 32, and 64 bit integer column types use the same set of
+potential encodings, which is basically whether they use RLE v1 or
+v2. If the PRESENT stream is not included, all of the values are
+present. For values that have false bits in the present stream, no
+values are included in the data stream.
+
+Encoding  | Stream Kind | Optional | Contents
+:-------- | :---------- | :------- | :-------
+DIRECT    | PRESENT     | Yes      | Boolean RLE
+          | DATA        | No       | Signed Integer RLE v1
+DIRECT_V2 | PRESENT     | Yes      | Boolean RLE
+          | DATA        | No       | Signed Integer RLE v2
+
+## Float and Double Columns
+
+Floating point types are stored using IEEE 754 floating point bit
+layout. Float columns use 4 bytes per value and double columns use 8
+bytes.
+
+Encoding  | Stream Kind | Optional | Contents
+:-------- | :---------- | :------- | :-------
+DIRECT    | PRESENT     | Yes      | Boolean RLE
+          | DATA        | No       | IEEE 754 floating point representation
+
+## String, Char, and VarChar Columns
+
+String, char, and varchar columns may be encoded either using a
+dictionary encoding or a direct encoding. A direct encoding should be
+preferred when there are many distinct values. In all of the
+encodings, the PRESENT stream encodes whether the value is null. The
+Java ORC writer automatically picks the encoding after the first row
+group (10,000 rows).
+
+For direct encoding the UTF-8 bytes are saved in the DATA stream and
+the length of each value is written into the LENGTH stream. In direct
+encoding, if the values were ["Nevada", "California"]; the DATA
+would be "NevadaCalifornia" and the LENGTH would be [6, 10].
+
+For dictionary encodings the dictionary is sorted and UTF-8 bytes of
+each unique value are placed into DICTIONARY_DATA. The length of each
+item in the dictionary is put into the LENGTH stream. The DATA stream
+consists of the sequence of references to the dictionary elements.
+
+In dictionary encoding, if the values were ["Nevada",
+"California", "Nevada", "California", and "Florida"]; the
+DICTIONARY_DATA would be "CaliforniaFloridaNevada" and LENGTH would
+be [10, 7, 6]. The DATA would be [2, 0, 2, 0, 1].
+
+Encoding      | Stream Kind     | Optional | Contents
+:------------ | :-------------- | :------- | :-------
+DIRECT        | PRESENT         | Yes      | Boolean RLE
+              | DATA            | No       | String contents
+              | LENGTH          | No       | Unsigned Integer RLE v1
+DICTIONARY    | PRESENT         | Yes      | Boolean RLE
+              | DATA            | No       | Unsigned Integer RLE v1
+              | DICTIONARY_DATA | No       | String contents
+              | LENGTH          | No       | Unsigned Integer RLE v1
+DIRECT_V2     | PRESENT         | Yes      | Boolean RLE
+              | DATA            | No       | String contents
+              | LENGTH          | No       | Unsigned Integer RLE v2
+DICTIONARY_V2 | PRESENT         | Yes      | Boolean RLE
+              | DATA            | No       | Unsigned Integer RLE v2
+              | DICTIONARY_DATA | No       | String contents
+              | LENGTH          | No       | Unsigned Integer RLE v2
+
+## Boolean Columns
+
+Boolean columns are rare, but have a simple encoding.
+
+Encoding      | Stream Kind     | Optional | Contents
+:------------ | :-------------- | :------- | :-------
+DIRECT        | PRESENT         | Yes      | Boolean RLE
+              | DATA            | No       | Boolean RLE
+
+## TinyInt Columns
+
+TinyInt (byte) columns use byte run length encoding.
+
+Encoding      | Stream Kind     | Optional | Contents
+:------------ | :-------------- | :------- | :-------
+DIRECT        | PRESENT         | Yes      | Boolean RLE
+              | DATA            | No       | Byte RLE
+
+## Binary Columns
+
+Binary data is encoded with a PRESENT stream, a DATA stream that records
+the contents, and a LENGTH stream that records the number of bytes per a
+value.
+
+Encoding      | Stream Kind     | Optional | Contents
+:------------ | :-------------- | :------- | :-------
+DIRECT        | PRESENT         | Yes      | Boolean RLE
+              | DATA            | No       | String contents
+              | LENGTH          | No       | Unsigned Integer RLE v1
+DIRECT_V2     | PRESENT         | Yes      | Boolean RLE
+              | DATA            | No       | String contents
+              | LENGTH          | No       | Unsigned Integer RLE v2
+
+## Decimal Columns
+
+Decimal was introduced in Hive 0.11 with infinite precision (the total
+number of digits). In Hive 0.13, the definition was change to limit
+the precision to a maximum of 38 digits, which conveniently uses 127
+bits plus a sign bit. The current encoding of decimal columns stores
+the integer representation of the value as an unbounded length zigzag
+encoded base 128 varint. The scale is stored in the SECONDARY stream
+as an signed integer.
+
+Encoding      | Stream Kind     | Optional | Contents
+:------------ | :-------------- | :------- | :-------
+DIRECT        | PRESENT         | Yes      | Boolean RLE
+              | DATA            | No       | Unbounded base 128 varints
+              | SECONDARY       | No       | Unsigned Integer RLE v1
+DIRECT_V2     | PRESENT         | Yes      | Boolean RLE
+              | DATA            | No       | Unbounded base 128 varints
+              | SECONDARY       | No       | Unsigned Integer RLE v2
+
+## Date Columns
+
+Date data is encoded with a PRESENT stream, a DATA stream that records
+the number of days after January 1, 1970 in UTC.
+
+Encoding      | Stream Kind     | Optional | Contents
+:------------ | :-------------- | :------- | :-------
+DIRECT        | PRESENT         | Yes      | Boolean RLE
+              | DATA            | No       | Signed Integer RLE v1
+DIRECT_V2     | PRESENT         | Yes      | Boolean RLE
+              | DATA            | No       | Signed Integer RLE v2
+
+## Timestamp Columns
+
+Timestamp records times down to nanoseconds as a PRESENT stream that
+records non-null values, a DATA stream that records the number of
+seconds after 1 January 2015, and a SECONDARY stream that records the
+number of nanoseconds.
+
+Because the number of nanoseconds often has a large number of trailing
+zeros, the number has trailing decimal zero digits removed and the
+last three bits are used to record how many zeros were removed. Thus
+1000 nanoseconds would be serialized as 0x0b and 100000 would be
+serialized as 0x0d.
+
+Encoding      | Stream Kind     | Optional | Contents
+:------------ | :-------------- | :------- | :-------
+DIRECT        | PRESENT         | Yes      | Boolean RLE
+              | DATA            | No       | Signed Integer RLE v1
+              | SECONDARY       | No       | Unsigned Integer RLE v1
+DIRECT_V2     | PRESENT         | Yes      | Boolean RLE
+              | DATA            | No       | Signed Integer RLE v2
+              | SECONDARY       | No       | Unsigned Integer RLE v2
+
+## Struct Columns
+
+Structs have no data themselves and delegate everything to their child
+columns except for their PRESENT stream. They have a child column
+for each of the fields.
+
+Encoding      | Stream Kind     | Optional | Contents
+:------------ | :-------------- | :------- | :-------
+DIRECT        | PRESENT         | Yes      | Boolean RLE
+
+## List Columns
+
+Lists are encoded as the PRESENT stream and a length stream with
+number of items in each list. They have a single child column for the
+element values.
+
+Encoding      | Stream Kind     | Optional | Contents
+:------------ | :-------------- | :------- | :-------
+DIRECT        | PRESENT         | Yes      | Boolean RLE
+              | LENGTH          | No       | Unsigned Integer RLE v1
+DIRECT_V2     | PRESENT         | Yes      | Boolean RLE
+              | LENGTH          | No       | Unsigned Integer RLE v2
+
+## Map Columns
+
+Maps are encoded as the PRESENT stream and a length stream with number
+of items in each list. They have a child column for the key and
+another child column for the value.
+
+Encoding      | Stream Kind     | Optional | Contents
+:------------ | :-------------- | :------- | :-------
+DIRECT        | PRESENT         | Yes      | Boolean RLE
+              | LENGTH          | No       | Unsigned Integer RLE v1
+DIRECT_V2     | PRESENT         | Yes      | Boolean RLE
+              | LENGTH          | No       | Unsigned Integer RLE v2
+
+## Union Columns
+
+Unions are encoded as the PRESENT stream and a tag stream that controls which
+potential variant is used. They have a child column for each variant of the
+union. Currently ORC union types are limited to 256 variants, which matches
+the Hive type model.
+
+Encoding      | Stream Kind     | Optional | Contents
+:------------ | :-------------- | :------- | :-------
+DIRECT        | PRESENT         | Yes      | Boolean RLE
+              | DIRECT          | No       | Byte RLE
+
+# Indexes
+
+## Row Group Index
+
+The row group indexes consist of a ROW_INDEX stream for each primitive
+column that has an entry for each row group. Row groups are controlled
+by the writer and default to 10,000 rows. Each RowIndexEntry gives the
+position of each stream for the column and the statistics for that row
+group.
+
+The index streams are placed at the front of the stripe, because in
+the default case of streaming they do not need to be read. They are
+only loaded when either predicate push down is being used or the
+reader seeks to a particular row.
+
+```message RowIndexEntry {
+ repeated uint64 positions = 1 [packed=true];
+ optional ColumnStatistics statistics = 2;
+}
+```
+
+```message RowIndex {
+ repeated RowIndexEntry entry = 1;
+}
+```
+
+To record positions, each stream needs a sequence of numbers. For
+uncompressed streams, the position is the byte offset of the RLE run's
+start location followed by the number of values that need to be
+consumed from the run. In compressed streams, the first number is the
+start of the compression chunk in the stream, followed by the number
+of decompressed bytes that need to be consumed, and finally the number
+of values consumed in the RLE.
+
+For columns with multiple streams, the sequences of positions in each
+stream are concatenated. That was an unfortunate decision on my part
+that we should fix at some point, because it makes code that uses the
+indexes error-prone.
+
+Because dictionaries are accessed randomly, there is not a position to
+record for the dictionary and the entire dictionary must be read even
+if only part of a stripe is being read.
+
+## Bloom Filter Index
+
+Bloom Filters are added to ORC indexes from Hive 1.2.0 onwards.
+Predicate pushdown can make use of bloom filters to better prune
+the row groups that do not satisfy the filter condition.
+The bloom filter indexes consist of a BLOOM_FILTER stream for each
+column specified through 'orc.bloom.filter.columns' table properties.
+A BLOOM_FILTER stream records a bloom filter entry for each row
+group (default to 10,000 rows) in a column. Only the row groups that
+satisfy min/max row index evaluation will be evaluated against the
+bloom filter index.
+
+Each BloomFilterEntry stores the number of hash functions ('k') used
+and the bitset backing the bloom filter. The original encoding (pre
+ORC-101) of bloom filters used the bitset field encoded as a repeating
+sequence of longs in the bitset field with a little endian encoding
+(0x1 is bit 0 and 0x2 is bit 1.) After ORC-101, the encoding is a
+sequence of bytes with a little endian encoding in the utf8bitset field.
+
+```message BloomFilter {
+ optional uint32 numHashFunctions = 1;
+ repeated fixed64 bitset = 2;
+ optional bytes utf8bitset = 3;
+}
+```
+
+```message BloomFilterIndex {
+ repeated BloomFilter bloomFilter = 1;
+}
+```
+
+Bloom filter internally uses two different hash functions to map a key
+to a position in the bit set. For tinyint, smallint, int, bigint, float
+and double types, Thomas Wang's 64-bit integer hash function is used.
+Floats are converted to IEEE-754 32 bit representation
+(using Java's Float.floatToIntBits(float)). Similary, Doubles are
+converted to IEEE-754 64 bit representation (using Java's
+Double.doubleToLongBits(double)). All these primitive types
+are cast to long base type before being passed on to the hash function.
+For strings and binary types, Murmur3 64 bit hash algorithm is used.
+The 64 bit variant of Murmur3 considers only the most significant
+8 bytes of Murmur3 128-bit algorithm. The 64 bit hashcode generated
+from the above algorithms is used as a base to derive 'k' different
+hash functions. We use the idea mentioned in the paper "Less Hashing,
+Same Performance: Building a Better Bloom Filter" by Kirsch et. al. to
+quickly compute the k hashcodes.
+
+The algorithm for computing k hashcodes and setting the bit position
+in a bloom filter is as follows:
+
+1. Get 64 bit base hash code from Murmur3 or Thomas Wang's hash algorithm.
+2. Split the above hashcode into two 32-bit hashcodes (say hash1 and hash2).
+3. k'th hashcode is obtained by (where k > 0):
+  * combinedHash = hash1 + (k * hash2)
+4. If combinedHash is negative flip all the bits:
+  * combinedHash = ~combinedHash
+5. Bit set position is obtained by performing modulo with m:
+  * position = combinedHash % m
+6. Set the position in bit set. The LSB 6 bits identifies the long index
+   within bitset and bit position within the long uses little endian order.
+  * bitset[position >>> 6] \|= (1L << position);
+
+Bloom filter streams are interlaced with row group indexes. This placement
+makes it convenient to read the bloom filter stream and row index stream
+together in single read operation.
+
+![bloom filter]({{ site.url }}/img/BloomFilter.png)

http://git-wip-us.apache.org/repos/asf/orc/blob/24093120/site/specification/ORCv2.md
----------------------------------------------------------------------
diff --git a/site/specification/ORCv2.md b/site/specification/ORCv2.md
new file mode 100644
index 0000000..b71b69b
--- /dev/null
+++ b/site/specification/ORCv2.md
@@ -0,0 +1,1032 @@
+---
+layout: page
+title: Evolving Draft for ORC Specification v2
+---
+
+This specification is rapidly evolving and should only be used for
+developers on the project.
+
+# TO DO items
+
+The list of things that we plan to change:
+
+* Create a decimal representation with fixed scale using rle.
+* Create a better float/double encoding that splits mantissa and
+  exponent.
+* Create a dictionary encoding for float, double, and decimal.
+* Create RLEv3:
+   * 64 and 128 bit variants
+   * Zero suppression
+   * Evaluate the rle subformats
+* Group stripe data into stripelets to enable Async IO for reads.
+* Reorder stripe data into (stripe metadata, index, dictionary, data)
+* Stop sorting dictionaries and record the sort order separately in the index.
+* Remove use of RLEv1 and RLEv2.
+* Remove non-utf8 bloom filter.
+* Use numeric value for decimal statistics and bloom filter.
+* Add Zstd with dictionary.
+
+# Motivation
+
+Hive's RCFile was the standard format for storing tabular data in
+Hadoop for several years. However, RCFile has limitations because it
+treats each column as a binary blob without semantics. In Hive 0.11 we
+added a new file format named Optimized Row Columnar (ORC) file that
+uses and retains the type information from the table definition. ORC
+uses type specific readers and writers that provide light weight
+compression techniques such as dictionary encoding, bit packing, delta
+encoding, and run length encoding -- resulting in dramatically smaller
+files. Additionally, ORC can apply generic compression using zlib, or
+Snappy on top of the lightweight compression for even smaller
+files. However, storage savings are only part of the gain. ORC
+supports projection, which selects subsets of the columns for reading,
+so that queries reading only one column read only the required
+bytes. Furthermore, ORC files include light weight indexes that
+include the minimum and maximum values for each column in each set of
+10,000 rows and the entire file. Using pushdown filters from Hive, the
+file reader can skip entire sets of rows that aren't important for
+this query.
+
+![ORC file structure]({{ site.url }}/img/OrcFileLayout.png)
+
+# File Tail
+
+Since HDFS does not support changing the data in a file after it is
+written, ORC stores the top level index at the end of the file. The
+overall structure of the file is given in the figure above.  The
+file's tail consists of 3 parts; the file metadata, file footer and
+postscript.
+
+The metadata for ORC is stored using
+[Protocol Buffers](https://s.apache.org/protobuf_encoding), which provides
+the ability to add new fields without breaking readers. This document
+incorporates the Protobuf definition from the
+[ORC source code](https://s.apache.org/orc_proto) and the
+reader is encouraged to review the Protobuf encoding if they need to
+understand the byte-level encoding
+
+## Postscript
+
+The Postscript section provides the necessary information to interpret
+the rest of the file including the length of the file's Footer and
+Metadata sections, the version of the file, and the kind of general
+compression used (eg. none, zlib, or snappy). The Postscript is never
+compressed and ends one byte before the end of the file. The version
+stored in the Postscript is the lowest version of Hive that is
+guaranteed to be able to read the file and it stored as a sequence of
+the major and minor version. This file version is encoded as [0,12].
+
+The process of reading an ORC file works backwards through the
+file. Rather than making multiple short reads, the ORC reader reads
+the last 16k bytes of the file with the hope that it will contain both
+the Footer and Postscript sections. The final byte of the file
+contains the serialized length of the Postscript, which must be less
+than 256 bytes. Once the Postscript is parsed, the compressed
+serialized length of the Footer is known and it can be decompressed
+and parsed.
+
+```message PostScript {
+ // the length of the footer section in bytes
+ optional uint64 footerLength = 1;
+ // the kind of generic compression used
+ optional CompressionKind compression = 2;
+ // the maximum size of each compression chunk
+ optional uint64 compressionBlockSize = 3;
+ // the version of the writer
+ repeated uint32 version = 4 [packed = true];
+ // the length of the metadata section in bytes
+ optional uint64 metadataLength = 5;
+ // the fixed string "ORC"
+ optional string magic = 8000;
+}
+```
+
+```enum CompressionKind {
+ NONE = 0;
+ ZLIB = 1;
+ SNAPPY = 2;
+ LZO = 3;
+ LZ4 = 4;
+ ZSTD = 5;
+}
+```
+
+## Footer
+
+The Footer section contains the layout of the body of the file, the
+type schema information, the number of rows, and the statistics about
+each of the columns.
+
+The file is broken in to three parts- Header, Body, and Tail. The
+Header consists of the bytes "ORC'' to support tools that want to
+scan the front of the file to determine the type of the file. The Body
+contains the rows and indexes, and the Tail gives the file level
+information as described in this section.
+
+```message Footer {
+ // the length of the file header in bytes (always 3)
+ optional uint64 headerLength = 1;
+ // the length of the file header and body in bytes
+ optional uint64 contentLength = 2;
+ // the information about the stripes
+ repeated StripeInformation stripes = 3;
+ // the schema information
+ repeated Type types = 4;
+ // the user metadata that was added
+ repeated UserMetadataItem metadata = 5;
+ // the total number of rows in the file
+ optional uint64 numberOfRows = 6;
+ // the statistics of each column across the file
+ repeated ColumnStatistics statistics = 7;
+ // the maximum number of rows in each index entry
+ optional uint32 rowIndexStride = 8;
+}
+```
+
+### Stripe Information
+
+The body of the file is divided into stripes. Each stripe is self
+contained and may be read using only its own bytes combined with the
+file's Footer and Postscript. Each stripe contains only entire rows so
+that rows never straddle stripe boundaries. Stripes have three
+sections: a set of indexes for the rows within the stripe, the data
+itself, and a stripe footer. Both the indexes and the data sections
+are divided by columns so that only the data for the required columns
+needs to be read.
+
+```message StripeInformation {
+ // the start of the stripe within the file
+ optional uint64 offset = 1;
+ // the length of the indexes in bytes
+ optional uint64 indexLength = 2;
+ // the length of the data in bytes
+ optional uint64 dataLength = 3;
+ // the length of the footer in bytes
+ optional uint64 footerLength = 4;
+ // the number of rows in the stripe
+ optional uint64 numberOfRows = 5;
+}
+```
+
+### Type Information
+
+All of the rows in an ORC file must have the same schema. Logically
+the schema is expressed as a tree as in the figure below, where
+the compound types have subcolumns under them.
+
+![ORC column structure]({{ site.url }}/img/TreeWriters.png)
+
+The equivalent Hive DDL would be:
+
+```create table Foobar (
+ myInt int,
+ myMap map<string,
+ struct<myString : string,
+ myDouble: double>>,
+ myTime timestamp
+);
+```
+
+The type tree is flattened in to a list via a pre-order traversal
+where each type is assigned the next id. Clearly the root of the type
+tree is always type id 0. Compound types have a field named subtypes
+that contains the list of their children's type ids.
+
+```message Type {
+ enum Kind {
+ BOOLEAN = 0;
+ BYTE = 1;
+ SHORT = 2;
+ INT = 3;
+ LONG = 4;
+ FLOAT = 5;
+ DOUBLE = 6;
+ STRING = 7;
+ BINARY = 8;
+ TIMESTAMP = 9;
+ LIST = 10;
+ MAP = 11;
+ STRUCT = 12;
+ UNION = 13;
+ DECIMAL = 14;
+ DATE = 15;
+ VARCHAR = 16;
+ CHAR = 17;
+ }
+ // the kind of this type
+ required Kind kind = 1;
+ // the type ids of any subcolumns for list, map, struct, or union
+ repeated uint32 subtypes = 2 [packed=true];
+ // the list of field names for struct
+ repeated string fieldNames = 3;
+ // the maximum length of the type for varchar or char in UTF-8 characters
+ optional uint32 maximumLength = 4;
+ // the precision and scale for decimal
+ optional uint32 precision = 5;
+ optional uint32 scale = 6;
+}
+```
+
+### Column Statistics
+
+The goal of the column statistics is that for each column, the writer
+records the count and depending on the type other useful fields. For
+most of the primitive types, it records the minimum and maximum
+values; and for numeric types it additionally stores the sum.
+From Hive 1.1.0 onwards, the column statistics will also record if
+there are any null values within the row group by setting the hasNull flag.
+The hasNull flag is used by ORC's predicate pushdown to better answer
+'IS NULL' queries.
+
+```message ColumnStatistics {
+ // the number of values
+ optional uint64 numberOfValues = 1;
+ // At most one of these has a value for any column
+ optional IntegerStatistics intStatistics = 2;
+ optional DoubleStatistics doubleStatistics = 3;
+ optional StringStatistics stringStatistics = 4;
+ optional BucketStatistics bucketStatistics = 5;
+ optional DecimalStatistics decimalStatistics = 6;
+ optional DateStatistics dateStatistics = 7;
+ optional BinaryStatistics binaryStatistics = 8;
+ optional TimestampStatistics timestampStatistics = 9;
+ optional bool hasNull = 10;
+}
+```
+
+For integer types (tinyint, smallint, int, bigint), the column
+statistics includes the minimum, maximum, and sum. If the sum
+overflows long at any point during the calculation, no sum is
+recorded.
+
+```message IntegerStatistics {
+ optional sint64 minimum = 1;
+ optional sint64 maximum = 2;
+ optional sint64 sum = 3;
+}
+```
+
+For floating point types (float, double), the column statistics
+include the minimum, maximum, and sum. If the sum overflows a double,
+no sum is recorded.
+
+```message DoubleStatistics {
+ optional double minimum = 1;
+ optional double maximum = 2;
+ optional double sum = 3;
+}
+```
+
+For strings, the minimum value, maximum value, and the sum of the
+lengths of the values are recorded.
+
+```message StringStatistics {
+ optional string minimum = 1;
+ optional string maximum = 2;
+ // sum will store the total length of all strings
+ optional sint64 sum = 3;
+}
+```
+
+For booleans, the statistics include the count of false and true values.
+
+```message BucketStatistics {
+ repeated uint64 count = 1 [packed=true];
+}
+```
+
+For decimals, the minimum, maximum, and sum are stored.
+
+```message DecimalStatistics {
+ optional string minimum = 1;
+ optional string maximum = 2;
+ optional string sum = 3;
+}
+```
+
+Date columns record the minimum and maximum values as the number of
+days since the epoch (1/1/2015).
+
+```message DateStatistics {
+ // min,max values saved as days since epoch
+ optional sint32 minimum = 1;
+ optional sint32 maximum = 2;
+}
+```
+
+Timestamp columns record the minimum and maximum values as the number of
+milliseconds since the epoch (1/1/2015).
+
+```message TimestampStatistics {
+ // min,max values saved as milliseconds since epoch
+ optional sint64 minimum = 1;
+ optional sint64 maximum = 2;
+}
+```
+
+Binary columns store the aggregate number of bytes across all of the values.
+
+```message BinaryStatistics {
+ // sum will store the total binary blob length
+ optional sint64 sum = 1;
+}
+```
+
+### User Metadata
+
+The user can add arbitrary key/value pairs to an ORC file as it is
+written. The contents of the keys and values are completely
+application defined, but the key is a string and the value is
+binary. Care should be taken by applications to make sure that their
+keys are unique and in general should be prefixed with an organization
+code.
+
+```message UserMetadataItem {
+ // the user defined key
+ required string name = 1;
+ // the user defined binary value
+ required bytes value = 2;
+}
+```
+
+### File Metadata
+
+The file Metadata section contains column statistics at the stripe
+level granularity. These statistics enable input split elimination
+based on the predicate push-down evaluated per a stripe.
+
+```message StripeStatistics {
+ repeated ColumnStatistics colStats = 1;
+}
+```
+
+```message Metadata {
+ repeated StripeStatistics stripeStats = 1;
+}
+```
+
+# Compression
+
+If the ORC file writer selects a generic compression codec (zlib or
+snappy), every part of the ORC file except for the Postscript is
+compressed with that codec. However, one of the requirements for ORC
+is that the reader be able to skip over compressed bytes without
+decompressing the entire stream. To manage this, ORC writes compressed
+streams in chunks with headers as in the figure below.
+To handle uncompressable data, if the compressed data is larger than
+the original, the original is stored and the isOriginal flag is
+set. Each header is 3 bytes long with (compressedLength * 2 +
+isOriginal) stored as a little endian value. For example, the header
+for a chunk that compressed to 100,000 bytes would be [0x40, 0x0d,
+0x03]. The header for 5 bytes that did not compress would be [0x0b,
+0x00, 0x00]. Each compression chunk is compressed independently so
+that as long as a decompressor starts at the top of a header, it can
+start decompressing without the previous bytes.
+
+![compression streams]({{ site.url }}/img/CompressionStream.png)
+
+The default compression chunk size is 256K, but writers can choose
+their own value. Larger chunks lead to better compression, but require
+more memory. The chunk size is recorded in the Postscript so that
+readers can allocate appropriately sized buffers. Readers are
+guaranteed that no chunk will expand to more than the compression chunk
+size.
+
+ORC files without generic compression write each stream directly
+with no headers.
+
+# Run Length Encoding
+
+## Base 128 Varint
+
+Variable width integer encodings take advantage of the fact that most
+numbers are small and that having smaller encodings for small numbers
+shrinks the overall size of the data. ORC uses the varint format from
+Protocol Buffers, which writes data in little endian format using the
+low 7 bits of each byte. The high bit in each byte is set if the
+number continues into the next byte.
+
+Unsigned Original | Serialized
+:---------------- | :---------
+0                 | 0x00
+1                 | 0x01
+127               | 0x7f
+128               | 0x80, 0x01
+129               | 0x81, 0x01
+16,383            | 0xff, 0x7f
+16,384            | 0x80, 0x80, 0x01
+16,385            | 0x81, 0x80, 0x01
+
+For signed integer types, the number is converted into an unsigned
+number using a zigzag encoding. Zigzag encoding moves the sign bit to
+the least significant bit using the expression (val << 1) ^ (val >>
+63) and derives its name from the fact that positive and negative
+numbers alternate once encoded. The unsigned number is then serialized
+as above.
+
+Signed Original | Unsigned
+:-------------- | :-------
+0               | 0
+-1              | 1
+1               | 2
+-2              | 3
+2               | 4
+
+## Byte Run Length Encoding
+
+For byte streams, ORC uses a very light weight encoding of identical
+values.
+
+* Run - a sequence of at least 3 identical values
+* Literals - a sequence of non-identical values
+
+The first byte of each group of values is a header than determines
+whether it is a run (value between 0 to 127) or literal list (value
+between -128 to -1). For runs, the control byte is the length of the
+run minus the length of the minimal run (3) and the control byte for
+literal lists is the negative length of the list. For example, a
+hundred 0's is encoded as [0x61, 0x00] and the sequence 0x44, 0x45
+would be encoded as [0xfe, 0x44, 0x45]. The next group can choose
+either of the encodings.
+
+## Boolean Run Length Encoding
+
+For encoding boolean types, the bits are put in the bytes from most
+significant to least significant. The bytes are encoded using byte run
+length encoding as described in the previous section. For example,
+the byte sequence [0xff, 0x80] would be one true followed by
+seven false values.
+
+## Integer Run Length Encoding, version 1
+
+In Hive 0.11 ORC files used Run Length Encoding version 1 (RLEv1),
+which provides a lightweight compression of signed or unsigned integer
+sequences. RLEv1 has two sub-encodings:
+
+* Run - a sequence of values that differ by a small fixed delta
+* Literals - a sequence of varint encoded values
+
+Runs start with an initial byte of 0x00 to 0x7f, which encodes the
+length of the run - 3. A second byte provides the fixed delta in the
+range of -128 to 127. Finally, the first value of the run is encoded
+as a base 128 varint.
+
+For example, if the sequence is 100 instances of 7 the encoding would
+start with 100 - 3, followed by a delta of 0, and a varint of 7 for
+an encoding of [0x61, 0x00, 0x07]. To encode the sequence of numbers
+running from 100 to 1, the first byte is 100 - 3, the delta is -1,
+and the varint is 100 for an encoding of [0x61, 0xff, 0x64].
+
+Literals start with an initial byte of 0x80 to 0xff, which corresponds
+to the negative of number of literals in the sequence. Following the
+header byte, the list of N varints is encoded. Thus, if there are
+no runs, the overhead is 1 byte for each 128 integers. The first 5
+prime numbers [2, 3, 4, 7, 11] would encoded as [0xfb, 0x02, 0x03,
+0x04, 0x07, 0xb].
+
+## Integer Run Length Encoding, version 2
+
+In Hive 0.12, ORC introduced Run Length Encoding version 2 (RLEv2),
+which has improved compression and fixed bit width encodings for
+faster expansion. RLEv2 uses four sub-encodings based on the data:
+
+* Short Repeat - used for short sequences with repeated values
+* Direct - used for random sequences with a fixed bit width
+* Patched Base - used for random sequences with a variable bit width
+* Delta - used for monotonically increasing or decreasing sequences
+
+### Short Repeat
+
+The short repeat encoding is used for short repeating integer
+sequences with the goal of minimizing the overhead of the header. All
+of the bits listed in the header are from the first byte to the last
+and from most significant bit to least significant bit. If the type is
+signed, the value is zigzag encoded.
+
+* 1 byte header
+  * 2 bits for encoding type (0)
+  * 3 bits for width (W) of repeating value (1 to 8 bytes)
+  * 3 bits for repeat count (3 to 10 values)
+* W bytes in big endian format, which is zigzag encoded if they type
+  is signed
+
+The unsigned sequence of [10000, 10000, 10000, 10000, 10000] would be
+serialized with short repeat encoding (0), a width of 2 bytes (1), and
+repeat count of 5 (2) as [0x0a, 0x27, 0x10].
+
+### Direct
+
+The direct encoding is used for integer sequences whose values have a
+relatively constant bit width. It encodes the values directly using a
+fixed width big endian encoding. The width of the values is encoded
+using the table below.
+ 
+The 5 bit width encoding table for RLEv2:
+
+Width in Bits | Encoded Value | Notes
+:------------ | :------------ | :----
+0             | 0             | for delta encoding
+1             | 0             | for non-delta encoding
+2             | 1
+4             | 3
+8             | 7
+16            | 15
+24            | 23
+32            | 27
+40            | 28
+48            | 29
+56            | 30
+64            | 31
+3             | 2             | deprecated
+5 <= x <= 7   | x - 1         | deprecated
+9 <= x <= 15  | x - 1         | deprecated
+17 <= x <= 21 | x - 1         | deprecated
+26            | 24            | deprecated
+28            | 25            | deprecated
+30            | 26            | deprecated
+
+* 2 bytes header
+  * 2 bits for encoding type (1)
+  * 5 bits for encoded width (W) of values (1 to 64 bits) using the 5 bit
+    width encoding table
+  * 9 bits for length (L) (1 to 512 values)
+* W * L bits (padded to the next byte) encoded in big endian format, which is
+  zigzag encoding if the type is signed
+
+The unsigned sequence of [23713, 43806, 57005, 48879] would be
+serialized with direct encoding (1), a width of 16 bits (15), and
+length of 4 (3) as [0x5e, 0x03, 0x5c, 0xa1, 0xab, 0x1e, 0xde, 0xad,
+0xbe, 0xef].
+
+### Patched Base
+
+The patched base encoding is used for integer sequences whose bit
+widths varies a lot. The minimum signed value of the sequence is found
+and subtracted from the other values. The bit width of those adjusted
+values is analyzed and the 90 percentile of the bit width is chosen
+as W. The 10\% of values larger than W use patches from a patch list
+to set the additional bits. Patches are encoded as a list of gaps in
+the index values and the additional value bits.
+
+* 4 bytes header
+  * 2 bits for encoding type (2)
+  * 5 bits for encoded width (W) of values (1 to 64 bits) using the 5 bit
+      width encoding table
+  * 9 bits for length (L) (1 to 512 values)
+  * 3 bits for base value width (BW) (1 to 8 bytes)
+  * 5 bits for patch width (PW) (1 to 64 bits) using  the 5 bit width
+    encoding table
+  * 3 bits for patch gap width (PGW) (1 to 8 bits)
+  * 5 bits for patch list length (PLL) (0 to 31 patches)
+* Base value (BW bytes) - The base value is stored as a big endian value
+  with negative values marked by the most significant bit set. If it that
+  bit is set, the entire value is negated.
+* Data values (W * L bits padded to the byte) - A sequence of W bit positive
+  values that are added to the base value.
+* Data values (W * L bits padded to the byte) - A sequence of W bit positive
+  values that are added to the base value.
+* Patch list (PLL * (PGW + PW) bytes) - A list of patches for values
+  that didn't fit within W bits. Each entry in the list consists of a
+  gap, which is the number of elements skipped from the previous
+  patch, and a patch value. Patches are applied by logically or'ing
+  the data values with the relevant patch shifted W bits left. If a
+  patch is 0, it was introduced to skip over more than 255 items. The
+  combined length of each patch (PGW + PW) must be less or equal to
+  64.
+
+The unsigned sequence of [2030, 2000, 2020, 1000000, 2040, 2050, 2060, 2070,
+2080, 2090, 2100, 2110, 2120, 2130, 2140, 2150, 2160, 2170, 2180, 2190]
+has a minimum of 2000, which makes the adjusted
+sequence [30, 0, 20, 998000, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140,
+150, 160, 170, 180, 190]. It has an
+encoding of patched base (2), a bit width of 8 (7), a length of 20
+(19), a base value width of 2 bytes (1), a patch width of 12 bits (11),
+patch gap width of 2 bits (1), and a patch list length of 1 (1). The
+base value is 2000 and the combined result is [0x8e, 0x13, 0x2b, 0x21, 0x07,
+0xd0, 0x1e, 0x00, 0x14, 0x70, 0x28, 0x32, 0x3c, 0x46, 0x50, 0x5a, 0x64, 0x6e,
+0x78, 0x82, 0x8c, 0x96, 0xa0, 0xaa, 0xb4, 0xbe, 0xfc, 0xe8]
+
+### Delta
+
+The Delta encoding is used for monotonically increasing or decreasing
+sequences. The first two numbers in the sequence can not be identical,
+because the encoding is using the sign of the first delta to determine
+if the series is increasing or decreasing.
+
+* 2 bytes header
+  * 2 bits for encoding type (3)
+  * 5 bits for encoded width (W) of deltas (0 to 64 bits) using the 5 bit
+    width encoding table
+  * 9 bits for run length (L) (1 to 512 values)
+* Base value - encoded as (signed or unsigned) varint
+* Delta base - encoded as signed varint
+* Delta values $W * (L - 2)$ bytes - encode each delta after the first
+  one. If the delta base is positive, the sequence is increasing and if it is
+  negative the sequence is decreasing.
+
+The unsigned sequence of [2, 3, 5, 7, 11, 13, 17, 19, 23, 29] would be
+serialized with delta encoding (3), a width of 4 bits (3), length of
+10 (9), a base of 2 (2), and first delta of 1 (2). The resulting
+sequence is [0xc6, 0x09, 0x02, 0x02, 0x22, 0x42, 0x42, 0x46].
+
+# Stripes
+
+The body of ORC files consists of a series of stripes. Stripes are
+large (typically ~200MB) and independent of each other and are often
+processed by different tasks. The defining characteristic for columnar
+storage formats is that the data for each column is stored separately
+and that reading data out of the file should be proportional to the
+number of columns read.
+
+In ORC files, each column is stored in several streams that are stored
+next to each other in the file. For example, an integer column is
+represented as two streams PRESENT, which uses one with a bit per
+value recording if the value is non-null, and DATA, which records the
+non-null values. If all of a column's values in a stripe are non-null,
+the PRESENT stream is omitted from the stripe. For binary data, ORC
+uses three streams PRESENT, DATA, and LENGTH, which stores the length
+of each value. The details of each type will be presented in the
+following subsections.
+
+## Stripe Footer
+
+The stripe footer contains the encoding of each column and the
+directory of the streams including their location.
+
+```message StripeFooter {
+ // the location of each stream
+ repeated Stream streams = 1;
+ // the encoding of each column
+ repeated ColumnEncoding columns = 2;
+}
+```
+
+To describe each stream, ORC stores the kind of stream, the column id,
+and the stream's size in bytes. The details of what is stored in each stream
+depends on the type and encoding of the column.
+
+```message Stream {
+ enum Kind {
+ // boolean stream of whether the next value is non-null
+ PRESENT = 0;
+ // the primary data stream
+ DATA = 1;
+ // the length of each value for variable length data
+ LENGTH = 2;
+ // the dictionary blob
+ DICTIONARY\_DATA = 3;
+ // deprecated prior to Hive 0.11
+ // It was used to store the number of instances of each value in the
+ // dictionary
+ DICTIONARY_COUNT = 4;
+ // a secondary data stream
+ SECONDARY = 5;
+ // the index for seeking to particular row groups
+ ROW_INDEX = 6;
+ // original bloom filters used before ORC-101
+ BLOOM_FILTER = 7;
+ // bloom filters that consistently use utf8
+ BLOOM_FILTER_UTF8 = 8;
+ }
+ required Kind kind = 1;
+ // the column id
+ optional uint32 column = 2;
+ // the number of bytes in the file
+ optional uint64 length = 3;
+}
+```
+
+Depending on their type several options for encoding are possible. The
+encodings are divided into direct or dictionary-based categories and
+further refined as to whether they use RLE v1 or v2.
+
+```message ColumnEncoding {
+ enum Kind {
+ // the encoding is mapped directly to the stream using RLE v1
+ DIRECT = 0;
+ // the encoding uses a dictionary of unique values using RLE v1
+ DICTIONARY = 1;
+ // the encoding is direct using RLE v2
+ DIRECT\_V2 = 2;
+ // the encoding is dictionary-based using RLE v2
+ DICTIONARY\_V2 = 3;
+ }
+ required Kind kind = 1;
+ // for dictionary encodings, record the size of the dictionary
+ optional uint32 dictionarySize = 2;
+}
+```
+
+# Column Encodings
+
+## SmallInt, Int, and BigInt Columns
+
+All of the 16, 32, and 64 bit integer column types use the same set of
+potential encodings, which is basically whether they use RLE v1 or
+v2. If the PRESENT stream is not included, all of the values are
+present. For values that have false bits in the present stream, no
+values are included in the data stream.
+
+Encoding  | Stream Kind | Optional | Contents
+:-------- | :---------- | :------- | :-------
+DIRECT    | PRESENT     | Yes      | Boolean RLE
+          | DATA        | No       | Signed Integer RLE v1
+DIRECT_V2 | PRESENT     | Yes      | Boolean RLE
+          | DATA        | No       | Signed Integer RLE v2
+
+## Float and Double Columns
+
+Floating point types are stored using IEEE 754 floating point bit
+layout. Float columns use 4 bytes per value and double columns use 8
+bytes.
+
+Encoding  | Stream Kind | Optional | Contents
+:-------- | :---------- | :------- | :-------
+DIRECT    | PRESENT     | Yes      | Boolean RLE
+          | DATA        | No       | IEEE 754 floating point representation
+
+## String, Char, and VarChar Columns
+
+String, char, and varchar columns may be encoded either using a
+dictionary encoding or a direct encoding. A direct encoding should be
+preferred when there are many distinct values. In all of the
+encodings, the PRESENT stream encodes whether the value is null. The
+Java ORC writer automatically picks the encoding after the first row
+group (10,000 rows).
+
+For direct encoding the UTF-8 bytes are saved in the DATA stream and
+the length of each value is written into the LENGTH stream. In direct
+encoding, if the values were ["Nevada", "California"]; the DATA
+would be "NevadaCalifornia" and the LENGTH would be [6, 10].
+
+For dictionary encodings the dictionary is sorted and UTF-8 bytes of
+each unique value are placed into DICTIONARY_DATA. The length of each
+item in the dictionary is put into the LENGTH stream. The DATA stream
+consists of the sequence of references to the dictionary elements.
+
+In dictionary encoding, if the values were ["Nevada",
+"California", "Nevada", "California", and "Florida"]; the
+DICTIONARY_DATA would be "CaliforniaFloridaNevada" and LENGTH would
+be [10, 7, 6]. The DATA would be [2, 0, 2, 0, 1].
+
+Encoding      | Stream Kind     | Optional | Contents
+:------------ | :-------------- | :------- | :-------
+DIRECT        | PRESENT         | Yes      | Boolean RLE
+              | DATA            | No       | String contents
+              | LENGTH          | No       | Unsigned Integer RLE v1
+DICTIONARY    | PRESENT         | Yes      | Boolean RLE
+              | DATA            | No       | Unsigned Integer RLE v1
+              | DICTIONARY_DATA | No       | String contents
+              | LENGTH          | No       | Unsigned Integer RLE v1
+DIRECT_V2     | PRESENT         | Yes      | Boolean RLE
+              | DATA            | No       | String contents
+              | LENGTH          | No       | Unsigned Integer RLE v2
+DICTIONARY_V2 | PRESENT         | Yes      | Boolean RLE
+              | DATA            | No       | Unsigned Integer RLE v2
+              | DICTIONARY_DATA | No       | String contents
+              | LENGTH          | No       | Unsigned Integer RLE v2
+
+## Boolean Columns
+
+Boolean columns are rare, but have a simple encoding.
+
+Encoding      | Stream Kind     | Optional | Contents
+:------------ | :-------------- | :------- | :-------
+DIRECT        | PRESENT         | Yes      | Boolean RLE
+              | DATA            | No       | Boolean RLE
+
+## TinyInt Columns
+
+TinyInt (byte) columns use byte run length encoding.
+
+Encoding      | Stream Kind     | Optional | Contents
+:------------ | :-------------- | :------- | :-------
+DIRECT        | PRESENT         | Yes      | Boolean RLE
+              | DATA            | No       | Byte RLE
+
+## Binary Columns
+
+Binary data is encoded with a PRESENT stream, a DATA stream that records
+the contents, and a LENGTH stream that records the number of bytes per a
+value.
+
+Encoding      | Stream Kind     | Optional | Contents
+:------------ | :-------------- | :------- | :-------
+DIRECT        | PRESENT         | Yes      | Boolean RLE
+              | DATA            | No       | String contents
+              | LENGTH          | No       | Unsigned Integer RLE v1
+DIRECT_V2     | PRESENT         | Yes      | Boolean RLE
+              | DATA            | No       | String contents
+              | LENGTH          | No       | Unsigned Integer RLE v2
+
+## Decimal Columns
+
+Decimal was introduced in Hive 0.11 with infinite precision (the total
+number of digits). In Hive 0.13, the definition was change to limit
+the precision to a maximum of 38 digits, which conveniently uses 127
+bits plus a sign bit. The current encoding of decimal columns stores
+the integer representation of the value as an unbounded length zigzag
+encoded base 128 varint. The scale is stored in the SECONDARY stream
+as an signed integer.
+
+Encoding      | Stream Kind     | Optional | Contents
+:------------ | :-------------- | :------- | :-------
+DIRECT        | PRESENT         | Yes      | Boolean RLE
+              | DATA            | No       | Unbounded base 128 varints
+              | SECONDARY       | No       | Unsigned Integer RLE v1
+DIRECT_V2     | PRESENT         | Yes      | Boolean RLE
+              | DATA            | No       | Unbounded base 128 varints
+              | SECONDARY       | No       | Unsigned Integer RLE v2
+
+## Date Columns
+
+Date data is encoded with a PRESENT stream, a DATA stream that records
+the number of days after January 1, 1970 in UTC.
+
+Encoding      | Stream Kind     | Optional | Contents
+:------------ | :-------------- | :------- | :-------
+DIRECT        | PRESENT         | Yes      | Boolean RLE
+              | DATA            | No       | Signed Integer RLE v1
+DIRECT_V2     | PRESENT         | Yes      | Boolean RLE
+              | DATA            | No       | Signed Integer RLE v2
+
+## Timestamp Columns
+
+Timestamp records times down to nanoseconds as a PRESENT stream that
+records non-null values, a DATA stream that records the number of
+seconds after 1 January 2015, and a SECONDARY stream that records the
+number of nanoseconds.
+
+Because the number of nanoseconds often has a large number of trailing
+zeros, the number has trailing decimal zero digits removed and the
+last three bits are used to record how many zeros were removed. Thus
+1000 nanoseconds would be serialized as 0x0b and 100000 would be
+serialized as 0x0d.
+
+Encoding      | Stream Kind     | Optional | Contents
+:------------ | :-------------- | :------- | :-------
+DIRECT        | PRESENT         | Yes      | Boolean RLE
+              | DATA            | No       | Signed Integer RLE v1
+              | SECONDARY       | No       | Unsigned Integer RLE v1
+DIRECT_V2     | PRESENT         | Yes      | Boolean RLE
+              | DATA            | No       | Signed Integer RLE v2
+              | SECONDARY       | No       | Unsigned Integer RLE v2
+
+## Struct Columns
+
+Structs have no data themselves and delegate everything to their child
+columns except for their PRESENT stream. They have a child column
+for each of the fields.
+
+Encoding      | Stream Kind     | Optional | Contents
+:------------ | :-------------- | :------- | :-------
+DIRECT        | PRESENT         | Yes      | Boolean RLE
+
+## List Columns
+
+Lists are encoded as the PRESENT stream and a length stream with
+number of items in each list. They have a single child column for the
+element values.
+
+Encoding      | Stream Kind     | Optional | Contents
+:------------ | :-------------- | :------- | :-------
+DIRECT        | PRESENT         | Yes      | Boolean RLE
+              | LENGTH          | No       | Unsigned Integer RLE v1
+DIRECT_V2     | PRESENT         | Yes      | Boolean RLE
+              | LENGTH          | No       | Unsigned Integer RLE v2
+
+## Map Columns
+
+Maps are encoded as the PRESENT stream and a length stream with number
+of items in each list. They have a child column for the key and
+another child column for the value.
+
+Encoding      | Stream Kind     | Optional | Contents
+:------------ | :-------------- | :------- | :-------
+DIRECT        | PRESENT         | Yes      | Boolean RLE
+              | LENGTH          | No       | Unsigned Integer RLE v1
+DIRECT_V2     | PRESENT         | Yes      | Boolean RLE
+              | LENGTH          | No       | Unsigned Integer RLE v2
+
+## Union Columns
+
+Unions are encoded as the PRESENT stream and a tag stream that controls which
+potential variant is used. They have a child column for each variant of the
+union. Currently ORC union types are limited to 256 variants, which matches
+the Hive type model.
+
+Encoding      | Stream Kind     | Optional | Contents
+:------------ | :-------------- | :------- | :-------
+DIRECT        | PRESENT         | Yes      | Boolean RLE
+              | DIRECT          | No       | Byte RLE
+
+# Indexes
+
+## Row Group Index
+
+The row group indexes consist of a ROW_INDEX stream for each primitive
+column that has an entry for each row group. Row groups are controlled
+by the writer and default to 10,000 rows. Each RowIndexEntry gives the
+position of each stream for the column and the statistics for that row
+group.
+
+The index streams are placed at the front of the stripe, because in
+the default case of streaming they do not need to be read. They are
+only loaded when either predicate push down is being used or the
+reader seeks to a particular row.
+
+```message RowIndexEntry {
+ repeated uint64 positions = 1 [packed=true];
+ optional ColumnStatistics statistics = 2;
+}
+```
+
+```message RowIndex {
+ repeated RowIndexEntry entry = 1;
+}
+```
+
+To record positions, each stream needs a sequence of numbers. For
+uncompressed streams, the position is the byte offset of the RLE run's
+start location followed by the number of values that need to be
+consumed from the run. In compressed streams, the first number is the
+start of the compression chunk in the stream, followed by the number
+of decompressed bytes that need to be consumed, and finally the number
+of values consumed in the RLE.
+
+For columns with multiple streams, the sequences of positions in each
+stream are concatenated. That was an unfortunate decision on my part
+that we should fix at some point, because it makes code that uses the
+indexes error-prone.
+
+Because dictionaries are accessed randomly, there is not a position to
+record for the dictionary and the entire dictionary must be read even
+if only part of a stripe is being read.
+
+## Bloom Filter Index
+
+Bloom Filters are added to ORC indexes from Hive 1.2.0 onwards.
+Predicate pushdown can make use of bloom filters to better prune
+the row groups that do not satisfy the filter condition.
+The bloom filter indexes consist of a BLOOM_FILTER stream for each
+column specified through 'orc.bloom.filter.columns' table properties.
+A BLOOM_FILTER stream records a bloom filter entry for each row
+group (default to 10,000 rows) in a column. Only the row groups that
+satisfy min/max row index evaluation will be evaluated against the
+bloom filter index.
+
+Each BloomFilterEntry stores the number of hash functions ('k') used
+and the bitset backing the bloom filter. The original encoding (pre
+ORC-101) of bloom filters used the bitset field encoded as a repeating
+sequence of longs in the bitset field with a little endian encoding
+(0x1 is bit 0 and 0x2 is bit 1.) After ORC-101, the encoding is a
+sequence of bytes with a little endian encoding in the utf8bitset field.
+
+```message BloomFilter {
+ optional uint32 numHashFunctions = 1;
+ repeated fixed64 bitset = 2;
+ optional bytes utf8bitset = 3;
+}
+```
+
+```message BloomFilterIndex {
+ repeated BloomFilter bloomFilter = 1;
+}
+```
+
+Bloom filter internally uses two different hash functions to map a key
+to a position in the bit set. For tinyint, smallint, int, bigint, float
+and double types, Thomas Wang's 64-bit integer hash function is used.
+Floats are converted to IEEE-754 32 bit representation
+(using Java's Float.floatToIntBits(float)). Similary, Doubles are
+converted to IEEE-754 64 bit representation (using Java's
+Double.doubleToLongBits(double)). All these primitive types
+are cast to long base type before being passed on to the hash function.
+For strings and binary types, Murmur3 64 bit hash algorithm is used.
+The 64 bit variant of Murmur3 considers only the most significant
+8 bytes of Murmur3 128-bit algorithm. The 64 bit hashcode generated
+from the above algorithms is used as a base to derive 'k' different
+hash functions. We use the idea mentioned in the paper "Less Hashing,
+Same Performance: Building a Better Bloom Filter" by Kirsch et. al. to
+quickly compute the k hashcodes.
+
+The algorithm for computing k hashcodes and setting the bit position
+in a bloom filter is as follows:
+
+1. Get 64 bit base hash code from Murmur3 or Thomas Wang's hash algorithm.
+2. Split the above hashcode into two 32-bit hashcodes (say hash1 and hash2).
+3. k'th hashcode is obtained by (where k > 0):
+  * combinedHash = hash1 + (k * hash2)
+4. If combinedHash is negative flip all the bits:
+  * combinedHash = ~combinedHash
+5. Bit set position is obtained by performing modulo with m:
+  * position = combinedHash % m
+6. Set the position in bit set. The LSB 6 bits identifies the long index
+   within bitset and bit position within the long uses little endian order.
+  * bitset[position >>> 6] \|= (1L << position);
+
+Bloom filter streams are interlaced with row group indexes. This placement
+makes it convenient to read the bloom filter stream and row index stream
+together in single read operation.
+
+![bloom filter]({{ site.url }}/img/BloomFilter.png)

http://git-wip-us.apache.org/repos/asf/orc/blob/24093120/site/specification/index.md
----------------------------------------------------------------------
diff --git a/site/specification/index.md b/site/specification/index.md
new file mode 100644
index 0000000..29782d4
--- /dev/null
+++ b/site/specification/index.md
@@ -0,0 +1,18 @@
+---
+layout: page
+title: ORC Specification
+---
+
+There have been two released ORC file versions:
+
+* [ORC v0](ORCv0.html) was released in Hive 0.11.
+* [ORC v1](ORCv1.html) was released in Hive 0.12 and ORC 1.x.
+
+Each version of the library will detect the format version and use
+the appropriate reader. The library can also write the older versions
+of the file format to ensure that users can write files that all of their
+clusters can read correctly.
+
+We are working on a new version of the file format:
+
+* [ORC v2](ORCv2.html) is a work in progress and is rapidly evolving.

[1/2] orc git commit: ORC-339. Reorganize the ORC file format specification.

Reply via email to