http://git-wip-us.apache.org/repos/asf/orc/blob/6a400548/docs/hive-ddl.html ---------------------------------------------------------------------- diff --git a/docs/hive-ddl.html b/docs/hive-ddl.html new file mode 100644 index 0000000..310bbfb --- /dev/null +++ b/docs/hive-ddl.html @@ -0,0 +1,998 @@ +<!DOCTYPE HTML> +<html lang="en-US"> +<head> + <meta charset="UTF-8"> + <title>Hive DDL</title> + <meta name="viewport" content="width=device-width,initial-scale=1"> + <meta name="generator" content="Jekyll v2.4.0"> + <link rel="stylesheet" href="//fonts.googleapis.com/css?family=Lato:300,300italic,400,400italic,700,700italic,900"> + <link rel="stylesheet" href="/css/screen.css"> + <link rel="icon" type="image/x-icon" href="/favicon.ico"> + <!--[if lt IE 9]> + <script src="/js/html5shiv.min.js"></script> + <script src="/js/respond.min.js"></script> + <![endif]--> +</head> + + +<body class="wrap"> + <header role="banner"> + <nav class="mobile-nav show-on-mobiles"> + <ul> + <li class=""> + <a href="/">Home</a> + </li> + <li class="current"> + <a href="/docs/">Documentation</a> + </li> + <li class=""> + <a href="/talks/">Talks</a> + </li> + <li class=""> + <a href="/news/">News</a> + </li> + <li class=""> + <a href="/help/">Help</a> + </li> + <li class=""> + <a href="/develop/">Develop</a> + </li> +</ul> + + </nav> + <div class="grid"> + <div class="unit one-third center-on-mobiles"> + <h1> + <a href="/"> + <span class="sr-only">Apache ORC</span> + <img src="/img/logo.png" width="249" height="115" alt="ORC Logo"> + </a> + </h1> + </div> + <nav class="main-nav unit two-thirds hide-on-mobiles"> + <ul> + <li class=""> + <a href="/">Home</a> + </li> + <li class="current"> + <a href="/docs/">Documentation</a> + </li> + <li class=""> + <a href="/talks/">Talks</a> + </li> + <li class=""> + <a href="/news/">News</a> + </li> + <li class=""> + <a href="/help/">Help</a> + </li> + <li class=""> + <a href="/develop/">Develop</a> + </li> +</ul> + + </nav> + </div> +</header> + + + <section class="docs"> + <div class="grid"> + + <div class="docs-nav-mobile unit whole show-on-mobiles"> + <select onchange="if (this.value) window.location.href=this.value"> + <option value="">Navigate the docsâ¦</option> + + <optgroup label="Overview"> + + + + + + + + + + + + + + + + + + + + <option value="/docs/index.html">Background</option> + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + <option value="/docs/types.html">Types</option> + + + + + + + + + + + + + + + + + + + + + + <option value="/docs/indexes.html">Indexes</option> + + + + + + + + + + + + + + + + + + <option value="/docs/acid.html">ACID support</option> + + + + + + + + + + + + + + + + + + + + + + + + + + + + + </optgroup> + + <optgroup label="Hive Usage"> + + + + + + + + + + + + + + + + + + <option value="/docs/hive-ddl.html">Hive DDL</option> + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + <option value="/docs/hive-config.html">Hive Configuration</option> + + + + + + + + + + + + + + + + + + + + + </optgroup> + + <optgroup label="Format Specification"> + + + + + + + + + + + + + + + + + + + + + + + + + + + + <option value="/docs/spec-intro.html">Introduction</option> + + + + + + + + + + + + + + + + + + <option value="/docs/file-tail.html">File Tail</option> + + + + + + + + + + + + + + + + + + + + + + + + + + + + <option value="/docs/compression.html">Compression</option> + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + <option value="/docs/run-length.html">Run Length Encoding</option> + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + <option value="/docs/stripes.html">Stripes</option> + + + + + + + + + + + + + + <option value="/docs/encodings.html">Column Encodings</option> + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + <option value="/docs/spec-index.html">Indexes</option> + + + + + + + + + + + </optgroup> + + </select> +</div> + + + <div class="unit four-fifths"> + <article> + <h1>Hive DDL</h1> + <p>ORC is well integrated into Hive, so storing your istari table as ORC +is done by adding âSTORED AS ORCâ.</p> + +<p><code>CREATE TABLE istari ( + name STRING, + color STRING +) STORED AS ORC; +</code></p> + +<p>To modify a table so that new partitions of the istari table are +stored as ORC files:</p> + +<p><code>ALTER TABLE istari SET FILEFORMAT ORC; +</code></p> + +<p>As of Hive 0.14, users can request an efficient merge of small ORC files +together by issuing a CONCATENATE command on their table or partition. The +files will be merged at the stripe level without reserializatoin.</p> + +<p><code>ALTER TABLE istari [PARTITION partition_spec] CONCATENATE; +</code></p> + +<p>To get information about an ORC file, use the orcfiledump command.</p> + +<p><code>% hive --orcfiledump <path_to_file> +</code></p> + +<p>As of Hive 1.1, to display the data in the ORC file, use:</p> + +<p><code>% hive --orcfiledump -d <path_to_file> +</code></p> + + + + + + + + + + + + + + + + + + + + + + <div class="section-nav"> + <div class="left align-right"> + + + + <a href="/docs/acid.html" class="prev">Back</a> + + </div> + <div class="right align-left"> + + + + <a href="/docs/hive-config.html" class="next">Next</a> + + </div> + </div> + <div class="clear"></div> + + + </article> + </div> + + <div class="unit one-fifth hide-on-mobiles"> + <aside> + + <h4>Overview</h4> + + +<ul> + + + + + + + + + + + + + + + + + + + + + + <li class=""><a href="/docs/index.html">Background</a></li> + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + <li class=""><a href="/docs/types.html">Types</a></li> + + + + + + + + + + + + + + + + + + + + + + + + + + <li class=""><a href="/docs/indexes.html">Indexes</a></li> + + + + + + + + + + + + <li class=""><a href="/docs/acid.html">ACID support</a></li> + + + +</ul> + + + <h4>Hive Usage</h4> + + +<ul> + + + + + + + + + + + + + + + + + + + + <li class="current"><a href="/docs/hive-ddl.html">Hive DDL</a></li> + + + + + + + + + + + + + + + + + + + + <li class=""><a href="/docs/hive-config.html">Hive Configuration</a></li> + + + +</ul> + + + <h4>Format Specification</h4> + + +<ul> + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + <li class=""><a href="/docs/spec-intro.html">Introduction</a></li> + + + + + + + + + + + + + + + + + + <li class=""><a href="/docs/file-tail.html">File Tail</a></li> + + + + + + + + + + + + + + <li class=""><a href="/docs/compression.html">Compression</a></li> + + + + + + + + + + + + + + + + + + + + + + + + + + + + <li class=""><a href="/docs/run-length.html">Run Length Encoding</a></li> + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + <li class=""><a href="/docs/stripes.html">Stripes</a></li> + + + + + + + + + + + + + + + + <li class=""><a href="/docs/encodings.html">Column Encodings</a></li> + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + <li class=""><a href="/docs/spec-index.html">Indexes</a></li> + + + +</ul> + + + </aside> +</div> + + + <div class="clear"></div> + + </div> + </section> + + + <footer role="contentinfo"> + <p>The contents of this website are © 2015 + <a href="https://www.apache.org/">Apache Software Foundation</a> + under the terms of the <a + href="https://www.apache.org/licenses/LICENSE-2.0.html"> + Apache License v2</a>. Apache ORC and its logo are trademarks + of the Apache Software Foundation.</p> +</footer> + + <script> + var anchorForId = function (id) { + var anchor = document.createElement("a"); + anchor.className = "header-link"; + anchor.href = "#" + id; + anchor.innerHTML = "<span class=\"sr-only\">Permalink</span><i class=\"fa fa-link\"></i>"; + anchor.title = "Permalink"; + return anchor; + }; + + var linkifyAnchors = function (level, containingElement) { + var headers = containingElement.getElementsByTagName("h" + level); + for (var h = 0; h < headers.length; h++) { + var header = headers[h]; + + if (typeof header.id !== "undefined" && header.id !== "") { + header.appendChild(anchorForId(header.id)); + } + } + }; + + document.onreadystatechange = function () { + if (this.readyState === "complete") { + var contentBlock = document.getElementsByClassName("docs")[0] || document.getElementsByClassName("news")[0]; + if (!contentBlock) { + return; + } + for (var level = 1; level <= 6; level++) { + linkifyAnchors(level, contentBlock); + } + } + }; +</script> + + +</body> +</html>
http://git-wip-us.apache.org/repos/asf/orc/blob/6a400548/docs/index.html ---------------------------------------------------------------------- diff --git a/docs/index.html b/docs/index.html new file mode 100644 index 0000000..8eae9af --- /dev/null +++ b/docs/index.html @@ -0,0 +1,984 @@ +<!DOCTYPE HTML> +<html lang="en-US"> +<head> + <meta charset="UTF-8"> + <title>Background</title> + <meta name="viewport" content="width=device-width,initial-scale=1"> + <meta name="generator" content="Jekyll v2.4.0"> + <link rel="stylesheet" href="//fonts.googleapis.com/css?family=Lato:300,300italic,400,400italic,700,700italic,900"> + <link rel="stylesheet" href="/css/screen.css"> + <link rel="icon" type="image/x-icon" href="/favicon.ico"> + <!--[if lt IE 9]> + <script src="/js/html5shiv.min.js"></script> + <script src="/js/respond.min.js"></script> + <![endif]--> +</head> + + +<body class="wrap"> + <header role="banner"> + <nav class="mobile-nav show-on-mobiles"> + <ul> + <li class=""> + <a href="/">Home</a> + </li> + <li class="current"> + <a href="/docs/">Documentation</a> + </li> + <li class=""> + <a href="/talks/">Talks</a> + </li> + <li class=""> + <a href="/news/">News</a> + </li> + <li class=""> + <a href="/help/">Help</a> + </li> + <li class=""> + <a href="/develop/">Develop</a> + </li> +</ul> + + </nav> + <div class="grid"> + <div class="unit one-third center-on-mobiles"> + <h1> + <a href="/"> + <span class="sr-only">Apache ORC</span> + <img src="/img/logo.png" width="249" height="115" alt="ORC Logo"> + </a> + </h1> + </div> + <nav class="main-nav unit two-thirds hide-on-mobiles"> + <ul> + <li class=""> + <a href="/">Home</a> + </li> + <li class="current"> + <a href="/docs/">Documentation</a> + </li> + <li class=""> + <a href="/talks/">Talks</a> + </li> + <li class=""> + <a href="/news/">News</a> + </li> + <li class=""> + <a href="/help/">Help</a> + </li> + <li class=""> + <a href="/develop/">Develop</a> + </li> +</ul> + + </nav> + </div> +</header> + + + <section class="docs"> + <div class="grid"> + + <div class="docs-nav-mobile unit whole show-on-mobiles"> + <select onchange="if (this.value) window.location.href=this.value"> + <option value="">Navigate the docsâ¦</option> + + <optgroup label="Overview"> + + + + + + + + + + + + + + + + + + + + <option value="/docs/index.html">Background</option> + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + <option value="/docs/types.html">Types</option> + + + + + + + + + + + + + + + + + + + + + + <option value="/docs/indexes.html">Indexes</option> + + + + + + + + + + + + + + + + + + <option value="/docs/acid.html">ACID support</option> + + + + + + + + + + + + + + + + + + + + + + + + + + + + + </optgroup> + + <optgroup label="Hive Usage"> + + + + + + + + + + + + + + + + + + <option value="/docs/hive-ddl.html">Hive DDL</option> + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + <option value="/docs/hive-config.html">Hive Configuration</option> + + + + + + + + + + + + + + + + + + + + + </optgroup> + + <optgroup label="Format Specification"> + + + + + + + + + + + + + + + + + + + + + + + + + + + + <option value="/docs/spec-intro.html">Introduction</option> + + + + + + + + + + + + + + + + + + <option value="/docs/file-tail.html">File Tail</option> + + + + + + + + + + + + + + + + + + + + + + + + + + + + <option value="/docs/compression.html">Compression</option> + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + <option value="/docs/run-length.html">Run Length Encoding</option> + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + <option value="/docs/stripes.html">Stripes</option> + + + + + + + + + + + + + + <option value="/docs/encodings.html">Column Encodings</option> + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + <option value="/docs/spec-index.html">Indexes</option> + + + + + + + + + + + </optgroup> + + </select> +</div> + + + <div class="unit four-fifths"> + <article> + <h1>Background</h1> + <p>Back in January 2013, we created ORC files as part of the initiative +to massively speed up Apache Hive and improve the storage efficiency +of data stored in Apache Hadoop. The focus was on enabling high speed +processing and reducing file sizes.</p> + +<p>ORC is a self-describing type-aware columnar file format designed for +Hadoop workloads. It is optimized for large streaming reads, but with +integrated support for finding required rows quickly. Storing data in +a columnar format lets the reader read, decompress, and process only +the values that are required for the current query. Because ORC files +are type-aware, the writer chooses the most appropriate encoding for +the type and builds an internal index as the file is written.</p> + +<p>Predicate pushdown uses those indexes to determine which stripes in a +file need to be read for a particular query and the row indexes can +narrow the search to a particular set of 10,000 rows. ORC supports the +complete set of types in Hive, including the complex types: structs, +lists, maps, and unions.</p> + +<p>Many large Hadoop users have adopted ORC. For instance, Facebook uses +ORC to <a href="http://s.apache.org/fb-scaling-300-pb">save tens of petabytes</a> +in their data warehouse and demonstrated that ORC is <a href="http://s.apache.org/presto-orc">significantly +faster</a> than RC File or Parquet. Yahoo +uses ORC to store their production data and has released some of their +<a href="http://s.apache.org/yahoo-orc">benchmark results</a>.</p> + +<p>ORC files are divided in to <em>stripes</em> that are roughly 64MB by +default. The stripes in a file are independent of each other and form +the natural unit of distributed work. Within each stripe, the columns +are separated from each other so the reader can read just the columns +that are required.</p> + + + + + + + + + + <div class="section-nav"> + <div class="left align-right"> + + <span class="prev disabled">Back</span> + + </div> + <div class="right align-left"> + + + + <a href="/docs/types.html" class="next">Next</a> + + </div> + </div> + <div class="clear"></div> + + + </article> + </div> + + <div class="unit one-fifth hide-on-mobiles"> + <aside> + + <h4>Overview</h4> + + +<ul> + + + + + + + + + + + + + + + + + + + + + + <li class="current"><a href="/docs/index.html">Background</a></li> + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + <li class=""><a href="/docs/types.html">Types</a></li> + + + + + + + + + + + + + + + + + + + + + + + + + + <li class=""><a href="/docs/indexes.html">Indexes</a></li> + + + + + + + + + + + + <li class=""><a href="/docs/acid.html">ACID support</a></li> + + + +</ul> + + + <h4>Hive Usage</h4> + + +<ul> + + + + + + + + + + + + + + + + + + + + <li class=""><a href="/docs/hive-ddl.html">Hive DDL</a></li> + + + + + + + + + + + + + + + + + + + + <li class=""><a href="/docs/hive-config.html">Hive Configuration</a></li> + + + +</ul> + + + <h4>Format Specification</h4> + + +<ul> + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + <li class=""><a href="/docs/spec-intro.html">Introduction</a></li> + + + + + + + + + + + + + + + + + + <li class=""><a href="/docs/file-tail.html">File Tail</a></li> + + + + + + + + + + + + + + <li class=""><a href="/docs/compression.html">Compression</a></li> + + + + + + + + + + + + + + + + + + + + + + + + + + + + <li class=""><a href="/docs/run-length.html">Run Length Encoding</a></li> + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + <li class=""><a href="/docs/stripes.html">Stripes</a></li> + + + + + + + + + + + + + + + + <li class=""><a href="/docs/encodings.html">Column Encodings</a></li> + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + <li class=""><a href="/docs/spec-index.html">Indexes</a></li> + + + +</ul> + + + </aside> +</div> + + + <div class="clear"></div> + + </div> + </section> + + + <footer role="contentinfo"> + <p>The contents of this website are © 2015 + <a href="https://www.apache.org/">Apache Software Foundation</a> + under the terms of the <a + href="https://www.apache.org/licenses/LICENSE-2.0.html"> + Apache License v2</a>. Apache ORC and its logo are trademarks + of the Apache Software Foundation.</p> +</footer> + + <script> + var anchorForId = function (id) { + var anchor = document.createElement("a"); + anchor.className = "header-link"; + anchor.href = "#" + id; + anchor.innerHTML = "<span class=\"sr-only\">Permalink</span><i class=\"fa fa-link\"></i>"; + anchor.title = "Permalink"; + return anchor; + }; + + var linkifyAnchors = function (level, containingElement) { + var headers = containingElement.getElementsByTagName("h" + level); + for (var h = 0; h < headers.length; h++) { + var header = headers[h]; + + if (typeof header.id !== "undefined" && header.id !== "") { + header.appendChild(anchorForId(header.id)); + } + } + }; + + document.onreadystatechange = function () { + if (this.readyState === "complete") { + var contentBlock = document.getElementsByClassName("docs")[0] || document.getElementsByClassName("news")[0]; + if (!contentBlock) { + return; + } + for (var level = 1; level <= 6; level++) { + linkifyAnchors(level, contentBlock); + } + } + }; +</script> + + +</body> +</html> http://git-wip-us.apache.org/repos/asf/orc/blob/6a400548/docs/indexes.html ---------------------------------------------------------------------- diff --git a/docs/indexes.html b/docs/indexes.html new file mode 100644 index 0000000..9ae634a --- /dev/null +++ b/docs/indexes.html @@ -0,0 +1,989 @@ +<!DOCTYPE HTML> +<html lang="en-US"> +<head> + <meta charset="UTF-8"> + <title>Indexes</title> + <meta name="viewport" content="width=device-width,initial-scale=1"> + <meta name="generator" content="Jekyll v2.4.0"> + <link rel="stylesheet" href="//fonts.googleapis.com/css?family=Lato:300,300italic,400,400italic,700,700italic,900"> + <link rel="stylesheet" href="/css/screen.css"> + <link rel="icon" type="image/x-icon" href="/favicon.ico"> + <!--[if lt IE 9]> + <script src="/js/html5shiv.min.js"></script> + <script src="/js/respond.min.js"></script> + <![endif]--> +</head> + + +<body class="wrap"> + <header role="banner"> + <nav class="mobile-nav show-on-mobiles"> + <ul> + <li class=""> + <a href="/">Home</a> + </li> + <li class="current"> + <a href="/docs/">Documentation</a> + </li> + <li class=""> + <a href="/talks/">Talks</a> + </li> + <li class=""> + <a href="/news/">News</a> + </li> + <li class=""> + <a href="/help/">Help</a> + </li> + <li class=""> + <a href="/develop/">Develop</a> + </li> +</ul> + + </nav> + <div class="grid"> + <div class="unit one-third center-on-mobiles"> + <h1> + <a href="/"> + <span class="sr-only">Apache ORC</span> + <img src="/img/logo.png" width="249" height="115" alt="ORC Logo"> + </a> + </h1> + </div> + <nav class="main-nav unit two-thirds hide-on-mobiles"> + <ul> + <li class=""> + <a href="/">Home</a> + </li> + <li class="current"> + <a href="/docs/">Documentation</a> + </li> + <li class=""> + <a href="/talks/">Talks</a> + </li> + <li class=""> + <a href="/news/">News</a> + </li> + <li class=""> + <a href="/help/">Help</a> + </li> + <li class=""> + <a href="/develop/">Develop</a> + </li> +</ul> + + </nav> + </div> +</header> + + + <section class="docs"> + <div class="grid"> + + <div class="docs-nav-mobile unit whole show-on-mobiles"> + <select onchange="if (this.value) window.location.href=this.value"> + <option value="">Navigate the docsâ¦</option> + + <optgroup label="Overview"> + + + + + + + + + + + + + + + + + + + + <option value="/docs/index.html">Background</option> + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + <option value="/docs/types.html">Types</option> + + + + + + + + + + + + + + + + + + + + + + <option value="/docs/indexes.html">Indexes</option> + + + + + + + + + + + + + + + + + + <option value="/docs/acid.html">ACID support</option> + + + + + + + + + + + + + + + + + + + + + + + + + + + + + </optgroup> + + <optgroup label="Hive Usage"> + + + + + + + + + + + + + + + + + + <option value="/docs/hive-ddl.html">Hive DDL</option> + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + <option value="/docs/hive-config.html">Hive Configuration</option> + + + + + + + + + + + + + + + + + + + + + </optgroup> + + <optgroup label="Format Specification"> + + + + + + + + + + + + + + + + + + + + + + + + + + + + <option value="/docs/spec-intro.html">Introduction</option> + + + + + + + + + + + + + + + + + + <option value="/docs/file-tail.html">File Tail</option> + + + + + + + + + + + + + + + + + + + + + + + + + + + + <option value="/docs/compression.html">Compression</option> + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + <option value="/docs/run-length.html">Run Length Encoding</option> + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + <option value="/docs/stripes.html">Stripes</option> + + + + + + + + + + + + + + <option value="/docs/encodings.html">Column Encodings</option> + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + <option value="/docs/spec-index.html">Indexes</option> + + + + + + + + + + + </optgroup> + + </select> +</div> + + + <div class="unit four-fifths"> + <article> + <h1>Indexes</h1> + <p>ORC provides three level of indexes within each file:</p> + +<ul> + <li>file level - statistics about the values in each column across the entire +file</li> + <li>stripe level - statistics about the values in each column for each stripe</li> + <li>row level - statistics about the values in each column for each set of +10,000 rows within a stripe</li> +</ul> + +<p>The file and stripe level column statistics are in the file footer so +that they are easy to access to determine if the rest of the file +needs to be read at all. Row level indexes include both the column +statistics for each row group and the position for seeking to the +start of the row group.</p> + +<p>Column statistics always contain the count of values and whether there +are null values present. Most other primitive types include the +minimum and maximum values and for numeric types the sum. As of Hive +1.2, the indexes can include bloom filters, which provide a much more +selective filter.</p> + +<p>The indexes at all levels are used by the reader using Search +ARGuments or SARGs, which are simplified expressions that restrict the +rows that are of interest. For example, if a query was looking for +people older than 100 years old, the SARG would be âage > 100â and +only files, stripes, or row groups that had people over 100 years old +would be read.</p> + + + + + + + + + + + + + + + + <div class="section-nav"> + <div class="left align-right"> + + + + <a href="/docs/types.html" class="prev">Back</a> + + </div> + <div class="right align-left"> + + + + <a href="/docs/acid.html" class="next">Next</a> + + </div> + </div> + <div class="clear"></div> + + + </article> + </div> + + <div class="unit one-fifth hide-on-mobiles"> + <aside> + + <h4>Overview</h4> + + +<ul> + + + + + + + + + + + + + + + + + + + + + + <li class=""><a href="/docs/index.html">Background</a></li> + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + <li class=""><a href="/docs/types.html">Types</a></li> + + + + + + + + + + + + + + + + + + + + + + + + + + <li class="current"><a href="/docs/indexes.html">Indexes</a></li> + + + + + + + + + + + + <li class=""><a href="/docs/acid.html">ACID support</a></li> + + + +</ul> + + + <h4>Hive Usage</h4> + + +<ul> + + + + + + + + + + + + + + + + + + + + <li class=""><a href="/docs/hive-ddl.html">Hive DDL</a></li> + + + + + + + + + + + + + + + + + + + + <li class=""><a href="/docs/hive-config.html">Hive Configuration</a></li> + + + +</ul> + + + <h4>Format Specification</h4> + + +<ul> + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + <li class=""><a href="/docs/spec-intro.html">Introduction</a></li> + + + + + + + + + + + + + + + + + + <li class=""><a href="/docs/file-tail.html">File Tail</a></li> + + + + + + + + + + + + + + <li class=""><a href="/docs/compression.html">Compression</a></li> + + + + + + + + + + + + + + + + + + + + + + + + + + + + <li class=""><a href="/docs/run-length.html">Run Length Encoding</a></li> + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + <li class=""><a href="/docs/stripes.html">Stripes</a></li> + + + + + + + + + + + + + + + + <li class=""><a href="/docs/encodings.html">Column Encodings</a></li> + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + <li class=""><a href="/docs/spec-index.html">Indexes</a></li> + + + +</ul> + + + </aside> +</div> + + + <div class="clear"></div> + + </div> + </section> + + + <footer role="contentinfo"> + <p>The contents of this website are © 2015 + <a href="https://www.apache.org/">Apache Software Foundation</a> + under the terms of the <a + href="https://www.apache.org/licenses/LICENSE-2.0.html"> + Apache License v2</a>. Apache ORC and its logo are trademarks + of the Apache Software Foundation.</p> +</footer> + + <script> + var anchorForId = function (id) { + var anchor = document.createElement("a"); + anchor.className = "header-link"; + anchor.href = "#" + id; + anchor.innerHTML = "<span class=\"sr-only\">Permalink</span><i class=\"fa fa-link\"></i>"; + anchor.title = "Permalink"; + return anchor; + }; + + var linkifyAnchors = function (level, containingElement) { + var headers = containingElement.getElementsByTagName("h" + level); + for (var h = 0; h < headers.length; h++) { + var header = headers[h]; + + if (typeof header.id !== "undefined" && header.id !== "") { + header.appendChild(anchorForId(header.id)); + } + } + }; + + document.onreadystatechange = function () { + if (this.readyState === "complete") { + var contentBlock = document.getElementsByClassName("docs")[0] || document.getElementsByClassName("news")[0]; + if (!contentBlock) { + return; + } + for (var level = 1; level <= 6; level++) { + linkifyAnchors(level, contentBlock); + } + } + }; +</script> + + +</body> +</html> http://git-wip-us.apache.org/repos/asf/orc/blob/6a400548/docs/run-length.html ---------------------------------------------------------------------- diff --git a/docs/run-length.html b/docs/run-length.html new file mode 100644 index 0000000..4821888 --- /dev/null +++ b/docs/run-length.html @@ -0,0 +1,1377 @@ +<!DOCTYPE HTML> +<html lang="en-US"> +<head> + <meta charset="UTF-8"> + <title>Run Length Encoding</title> + <meta name="viewport" content="width=device-width,initial-scale=1"> + <meta name="generator" content="Jekyll v2.4.0"> + <link rel="stylesheet" href="//fonts.googleapis.com/css?family=Lato:300,300italic,400,400italic,700,700italic,900"> + <link rel="stylesheet" href="/css/screen.css"> + <link rel="icon" type="image/x-icon" href="/favicon.ico"> + <!--[if lt IE 9]> + <script src="/js/html5shiv.min.js"></script> + <script src="/js/respond.min.js"></script> + <![endif]--> +</head> + + +<body class="wrap"> + <header role="banner"> + <nav class="mobile-nav show-on-mobiles"> + <ul> + <li class=""> + <a href="/">Home</a> + </li> + <li class="current"> + <a href="/docs/">Documentation</a> + </li> + <li class=""> + <a href="/talks/">Talks</a> + </li> + <li class=""> + <a href="/news/">News</a> + </li> + <li class=""> + <a href="/help/">Help</a> + </li> + <li class=""> + <a href="/develop/">Develop</a> + </li> +</ul> + + </nav> + <div class="grid"> + <div class="unit one-third center-on-mobiles"> + <h1> + <a href="/"> + <span class="sr-only">Apache ORC</span> + <img src="/img/logo.png" width="249" height="115" alt="ORC Logo"> + </a> + </h1> + </div> + <nav class="main-nav unit two-thirds hide-on-mobiles"> + <ul> + <li class=""> + <a href="/">Home</a> + </li> + <li class="current"> + <a href="/docs/">Documentation</a> + </li> + <li class=""> + <a href="/talks/">Talks</a> + </li> + <li class=""> + <a href="/news/">News</a> + </li> + <li class=""> + <a href="/help/">Help</a> + </li> + <li class=""> + <a href="/develop/">Develop</a> + </li> +</ul> + + </nav> + </div> +</header> + + + <section class="docs"> + <div class="grid"> + + <div class="docs-nav-mobile unit whole show-on-mobiles"> + <select onchange="if (this.value) window.location.href=this.value"> + <option value="">Navigate the docsâ¦</option> + + <optgroup label="Overview"> + + + + + + + + + + + + + + + + + + + + <option value="/docs/index.html">Background</option> + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + <option value="/docs/types.html">Types</option> + + + + + + + + + + + + + + + + + + + + + + <option value="/docs/indexes.html">Indexes</option> + + + + + + + + + + + + + + + + + + <option value="/docs/acid.html">ACID support</option> + + + + + + + + + + + + + + + + + + + + + + + + + + + + + </optgroup> + + <optgroup label="Hive Usage"> + + + + + + + + + + + + + + + + + + <option value="/docs/hive-ddl.html">Hive DDL</option> + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + <option value="/docs/hive-config.html">Hive Configuration</option> + + + + + + + + + + + + + + + + + + + + + </optgroup> + + <optgroup label="Format Specification"> + + + + + + + + + + + + + + + + + + + + + + + + + + + + <option value="/docs/spec-intro.html">Introduction</option> + + + + + + + + + + + + + + + + + + <option value="/docs/file-tail.html">File Tail</option> + + + + + + + + + + + + + + + + + + + + + + + + + + + + <option value="/docs/compression.html">Compression</option> + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + <option value="/docs/run-length.html">Run Length Encoding</option> + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + <option value="/docs/stripes.html">Stripes</option> + + + + + + + + + + + + + + <option value="/docs/encodings.html">Column Encodings</option> + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + <option value="/docs/spec-index.html">Indexes</option> + + + + + + + + + + + </optgroup> + + </select> +</div> + + + <div class="unit four-fifths"> + <article> + <h1>Run Length Encoding</h1> + <h1 id="base-128-varint">Base 128 Varint</h1> + +<p>Variable width integer encodings take advantage of the fact that most +numbers are small and that having smaller encodings for small numbers +shrinks the overall size of the data. ORC uses the varint format from +Protocol Buffers, which writes data in little endian format using the +low 7 bits of each byte. The high bit in each byte is set if the +number continues into the next byte.</p> + +<table> + <thead> + <tr> + <th style="text-align: left">Unsigned Original</th> + <th style="text-align: left">Serialized</th> + </tr> + </thead> + <tbody> + <tr> + <td style="text-align: left">0</td> + <td style="text-align: left">0x00</td> + </tr> + <tr> + <td style="text-align: left">1</td> + <td style="text-align: left">0x01</td> + </tr> + <tr> + <td style="text-align: left">127</td> + <td style="text-align: left">0x7f</td> + </tr> + <tr> + <td style="text-align: left">128</td> + <td style="text-align: left">0x80, 0x01</td> + </tr> + <tr> + <td style="text-align: left">129</td> + <td style="text-align: left">0x81, 0x01</td> + </tr> + <tr> + <td style="text-align: left">16,383</td> + <td style="text-align: left">0xff, 0x7f</td> + </tr> + <tr> + <td style="text-align: left">16,384</td> + <td style="text-align: left">0x80, 0x80, 0x01</td> + </tr> + <tr> + <td style="text-align: left">16,385</td> + <td style="text-align: left">0x81, 0x80, 0x01</td> + </tr> + </tbody> +</table> + +<p>For signed integer types, the number is converted into an unsigned +number using a zigzag encoding. Zigzag encoding moves the sign bit to +the least significant bit using the expression (val « 1) ^ (val » +63) and derives its name from the fact that positive and negative +numbers alternate once encoded. The unsigned number is then serialized +as above.</p> + +<table> + <thead> + <tr> + <th style="text-align: left">Signed Original</th> + <th style="text-align: left">Unsigned</th> + </tr> + </thead> + <tbody> + <tr> + <td style="text-align: left">0</td> + <td style="text-align: left">0</td> + </tr> + <tr> + <td style="text-align: left">-1</td> + <td style="text-align: left">1</td> + </tr> + <tr> + <td style="text-align: left">1</td> + <td style="text-align: left">2</td> + </tr> + <tr> + <td style="text-align: left">-2</td> + <td style="text-align: left">3</td> + </tr> + <tr> + <td style="text-align: left">2</td> + <td style="text-align: left">4</td> + </tr> + </tbody> +</table> + +<h1 id="byte-run-length-encoding">Byte Run Length Encoding</h1> + +<p>For byte streams, ORC uses a very light weight encoding of identical +values.</p> + +<ul> + <li>Run - a sequence of at least 3 identical values</li> + <li>Literals - a sequence of non-identical values</li> +</ul> + +<p>The first byte of each group of values is a header than determines +whether it is a run (value between 0 to 127) or literal list (value +between -128 to -1). For runs, the control byte is the length of the +run minus the length of the minimal run (3) and the control byte for +literal lists is the negative length of the list. For example, a +hundred 0âs is encoded as [0x61, 0x00] and the sequence 0x44, 0x45 +would be encoded as [0xfe, 0x44, 0x45]. The next group can choose +either of the encodings.</p> + +<h1 id="boolean-run-length-encoding">Boolean Run Length Encoding</h1> + +<p>For encoding boolean types, the bits are put in the bytes from most +significant to least significant. The bytes are encoded using byte run +length encoding as described in the previous section. For example, +the byte sequence [0xff, 0x80] would be one true followed by +seven false values.</p> + +<h1 id="integer-run-length-encoding-version-1">Integer Run Length Encoding, version 1</h1> + +<p>In Hive 0.11 ORC files used Run Length Encoding version 1 (RLEv1), +which provides a lightweight compression of signed or unsigned integer +sequences. RLEv1 has two sub-encodings:</p> + +<ul> + <li>Run - a sequence of values that differ by a small fixed delta</li> + <li>Literals - a sequence of varint encoded values</li> +</ul> + +<p>Runs start with an initial byte of 0x00 to 0x7f, which encodes the +length of the run - 3. A second byte provides the fixed delta in the +range of -128 to 127. Finally, the first value of the run is encoded +as a base 128 varint.</p> + +<p>For example, if the sequence is 100 instances of 7 the encoding would +start with 100 - 3, followed by a delta of 0, and a varint of 7 for +an encoding of [0x61, 0x00, 0x07]. To encode the sequence of numbers +running from 100 to 1, the first byte is 100 - 3, the delta is -1, +and the varint is 100 for an encoding of [0x61, 0xff, 0x64].</p> + +<p>Literals start with an initial byte of 0x80 to 0xff, which corresponds +to the negative of number of literals in the sequence. Following the +header byte, the list of N varints is encoded. Thus, if there are +no runs, the overhead is 1 byte for each 128 integers. The first 5 +prime numbers [2, 3, 4, 7, 11] would encoded as [0xfb, 0x02, 0x03, +0x04, 0x07, 0xb].</p> + +<h1 id="integer-run-length-encoding-version-2">Integer Run Length Encoding, version 2</h1> + +<p>In Hive 0.12, ORC introduced Run Length Encoding version 2 (RLEv2), +which has improved compression and fixed bit width encodings for +faster expansion. RLEv2 uses four sub-encodings based on the data:</p> + +<ul> + <li>Short Repeat - used for short sequences with repeated values</li> + <li>Direct - used for random sequences with a fixed bit width</li> + <li>Patched Base - used for random sequences with a variable bit width</li> + <li>Delta - used for monotonically increasing or decreasing sequences</li> +</ul> + +<h2 id="short-repeat">Short Repeat</h2> + +<p>The short repeat encoding is used for short repeating integer +sequences with the goal of minimizing the overhead of the header. All +of the bits listed in the header are from the first byte to the last +and from most significant bit to least significant bit. If the type is +signed, the value is zigzag encoded.</p> + +<ul> + <li>1 byte header + <ul> + <li>2 bits for encoding type (0)</li> + <li>3 bits for width (W) of repeating value (1 to 8 bytes)</li> + <li>3 bits for repeat count (3 to 10 values)</li> + </ul> + </li> + <li>W bytes in big endian format, which is zigzag encoded if they type +is signed</li> +</ul> + +<p>The unsigned sequence of [10000, 10000, 10000, 10000, 10000] would be +serialized with short repeat encoding (0), a width of 2 bytes (1), and +repeat count of 5 (2) as [0x0a, 0x27, 0x10].</p> + +<h2 id="direct">Direct</h2> + +<p>The direct encoding is used for integer sequences whose values have a +relatively constant bit width. It encodes the values directly using a +fixed width big endian encoding. The width of the values is encoded +using the table below.</p> + +<p>The 5 bit width encoding table for RLEv2:</p> + +<table> + <thead> + <tr> + <th style="text-align: left">Width in Bits</th> + <th style="text-align: left">Encoded Value</th> + <th style="text-align: left">Notes</th> + </tr> + </thead> + <tbody> + <tr> + <td style="text-align: left">0</td> + <td style="text-align: left">0</td> + <td style="text-align: left">for delta encoding</td> + </tr> + <tr> + <td style="text-align: left">1</td> + <td style="text-align: left">0</td> + <td style="text-align: left">for non-delta encoding</td> + </tr> + <tr> + <td style="text-align: left">2</td> + <td style="text-align: left">1</td> + <td style="text-align: left"> </td> + </tr> + <tr> + <td style="text-align: left">4</td> + <td style="text-align: left">3</td> + <td style="text-align: left"> </td> + </tr> + <tr> + <td style="text-align: left">8</td> + <td style="text-align: left">7</td> + <td style="text-align: left"> </td> + </tr> + <tr> + <td style="text-align: left">16</td> + <td style="text-align: left">15</td> + <td style="text-align: left"> </td> + </tr> + <tr> + <td style="text-align: left">24</td> + <td style="text-align: left">23</td> + <td style="text-align: left"> </td> + </tr> + <tr> + <td style="text-align: left">32</td> + <td style="text-align: left">27</td> + <td style="text-align: left"> </td> + </tr> + <tr> + <td style="text-align: left">40</td> + <td style="text-align: left">28</td> + <td style="text-align: left"> </td> + </tr> + <tr> + <td style="text-align: left">48</td> + <td style="text-align: left">29</td> + <td style="text-align: left"> </td> + </tr> + <tr> + <td style="text-align: left">56</td> + <td style="text-align: left">30</td> + <td style="text-align: left"> </td> + </tr> + <tr> + <td style="text-align: left">64</td> + <td style="text-align: left">31</td> + <td style="text-align: left"> </td> + </tr> + <tr> + <td style="text-align: left">3</td> + <td style="text-align: left">2</td> + <td style="text-align: left">deprecated</td> + </tr> + <tr> + <td style="text-align: left">5 <= x <= 7</td> + <td style="text-align: left">x - 1</td> + <td style="text-align: left">deprecated</td> + </tr> + <tr> + <td style="text-align: left">9 <= x <= 15</td> + <td style="text-align: left">x - 1</td> + <td style="text-align: left">deprecated</td> + </tr> + <tr> + <td style="text-align: left">17 <= x <= 21</td> + <td style="text-align: left">x - 1</td> + <td style="text-align: left">deprecated</td> + </tr> + <tr> + <td style="text-align: left">26</td> + <td style="text-align: left">24</td> + <td style="text-align: left">deprecated</td> + </tr> + <tr> + <td style="text-align: left">28</td> + <td style="text-align: left">25</td> + <td style="text-align: left">deprecated</td> + </tr> + <tr> + <td style="text-align: left">30</td> + <td style="text-align: left">26</td> + <td style="text-align: left">deprecated</td> + </tr> + </tbody> +</table> + +<ul> + <li>2 bytes header + <ul> + <li>2 bits for encoding type (1)</li> + <li>5 bits for encoded width (W) of values (1 to 64 bits) using the 5 bit +width encoding table</li> + <li>9 bits for length (L) (1 to 512 values)</li> + </ul> + </li> + <li>W * L bits (padded to the next byte) encoded in big endian format, which is +zigzag encoding if the type is signed</li> +</ul> + +<p>The unsigned sequence of [23713, 43806, 57005, 48879] would be +serialized with direct encoding (1), a width of 16 bits (15), and +length of 4 (3) as [0x5e, 0x03, 0x5c, 0xa1, 0xab, 0x1e, 0xde, 0xad, +0xbe, 0xef].</p> + +<h2 id="patched-base">Patched Base</h2> + +<p>The patched base encoding is used for integer sequences whose bit +widths varies a lot. The minimum signed value of the sequence is found +and subtracted from the other values. The bit width of those adjusted +values is analyzed and the 90 percentile of the bit width is chosen +as W. The 10\% of values larger than W use patches from a patch list +to set the additional bits. Patches are encoded as a list of gaps in +the index values and the additional value bits.</p> + +<ul> + <li>4 bytes header + <ul> + <li>2 bits for encoding type (2)</li> + <li>5 bits for encoded width (W) of values (1 to 64 bits) using the 5 bit + width encoding table</li> + <li>9 bits for length (L) (1 to 512 values)</li> + <li>3 bits for base value width (BW) (1 to 8 bytes)</li> + <li>5 bits for patch width (PW) (1 to 64 bits) using the 5 bit width +encoding table</li> + <li>3 bits for patch gap width (PGW) (1 to 8 bits)</li> + <li>5 bits for patch list length (PLL) (0 to 31 patches)</li> + </ul> + </li> + <li>Base value (BW bytes) - The base value is stored as a big endian value +with negative values marked by the most significant bit set. If it that +bit is set, the entire value is negated.</li> + <li>Data values (W * L bits padded to the byte) - A sequence of W bit positive +values that are added to the base value.</li> + <li>Data values (W * L bits padded to the byte) - A sequence of W bit positive +values that are added to the base value.</li> + <li>Patch list (PLL * (PGW + PW) bytes) - A list of patches for values +that didnât fit within W bits. Each entry in the list consists of a +gap, which is the number of elements skipped from the previous +patch, and a patch value. Patches are applied by logically orâing +the data values with the relevant patch shifted W bits left. If a +patch is 0, it was introduced to skip over more than 255 items. The +combined length of each patch (PGW + PW) must be less or equal to +64.</li> +</ul> + +<p>The unsigned sequence of [2030, 2000, 2020, 1000000, 2040, 2050, 2060, +2070, 2080, 2090] has a minimum of 2000, which makes the adjusted +sequence [30, 0, 20, 998000, 40, 50, 60, 70, 80, 90]. It has an +encoding of patched base (2), a bit width of 8 (7), a length of 10 +(9), a base value width of 2 bytes (1), a patch width of 12 bits (11), +patch gap width of 2 bits (1), and a patch list length of 1 (1). The +base value is 2000 and the combined result is [0x8e, 0x09, 0x2b, 0x21, +0x07, 0xd0, 0x1e, 0x00, 0x14, 0x70, 0x28, 0x32, 0x3c, 0x46, 0x50, +0x5a, 0xfc, 0xe8]</p> + +<h2 id="delta">Delta</h2> + +<p>The Delta encoding is used for monotonically increasing or decreasing +sequences. The first two numbers in the sequence can not be identical, +because the encoding is using the sign of the first delta to determine +if the series is increasing or decreasing.</p> + +<ul> + <li>2 bytes header + <ul> + <li>2 bits for encoding type (3)</li> + <li>5 bits for encoded width (W) of deltas (0 to 64 bits) using the 5 bit +width encoding table</li> + <li>9 bits for run length (L) (1 to 512 values)</li> + </ul> + </li> + <li>Base value - encoded as (signed or unsigned) varint</li> + <li>Delta base - encoded as signed varint</li> + <li>Delta values $W * (L - 2)$ bytes - encode each delta after the first +one. If the delta base is positive, the sequence is increasing and if it is +negative the sequence is decreasing.</li> +</ul> + +<p>The unsigned sequence of [2, 3, 5, 7, 11, 13, 17, 19, 23, 29] would be +serialized with delta encoding (3), a width of 4 bits (3), length of +10 (9), a base of 2 (2), and first delta of 1 (2). The resulting +sequence is [0xc6, 0x09, 0x02, 0x02, 0x22, 0x42, 0x42, 0x46].</p> + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + <div class="section-nav"> + <div class="left align-right"> + + + + <a href="/docs/compression.html" class="prev">Back</a> + + </div> + <div class="right align-left"> + + + + <a href="/docs/stripes.html" class="next">Next</a> + + </div> + </div> + <div class="clear"></div> + + + </article> + </div> + + <div class="unit one-fifth hide-on-mobiles"> + <aside> + + <h4>Overview</h4> + + +<ul> + + + + + + + + + + + + + + + + + + + + + + <li class=""><a href="/docs/index.html">Background</a></li> + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + <li class=""><a href="/docs/types.html">Types</a></li> + + + + + + + + + + + + + + + + + + + + + + + + + + <li class=""><a href="/docs/indexes.html">Indexes</a></li> + + + + + + + + + + + + <li class=""><a href="/docs/acid.html">ACID support</a></li> + + + +</ul> + + + <h4>Hive Usage</h4> + + +<ul> + + + + + + + + + + + + + + + + + + + + <li class=""><a href="/docs/hive-ddl.html">Hive DDL</a></li> + + + + + + + + + + + + + + + + + + + + <li class=""><a href="/docs/hive-config.html">Hive Configuration</a></li> + + + +</ul> + + + <h4>Format Specification</h4> + + +<ul> + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + <li class=""><a href="/docs/spec-intro.html">Introduction</a></li> + + + + + + + + + + + + + + + + + + <li class=""><a href="/docs/file-tail.html">File Tail</a></li> + + + + + + + + + + + + + + <li class=""><a href="/docs/compression.html">Compression</a></li> + + + + + + + + + + + + + + + + + + + + + + + + + + + + <li class="current"><a href="/docs/run-length.html">Run Length Encoding</a></li> + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + <li class=""><a href="/docs/stripes.html">Stripes</a></li> + + + + + + + + + + + + + + + + <li class=""><a href="/docs/encodings.html">Column Encodings</a></li> + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + <li class=""><a href="/docs/spec-index.html">Indexes</a></li> + + + +</ul> + + + </aside> +</div> + + + <div class="clear"></div> + + </div> + </section> + + + <footer role="contentinfo"> + <p>The contents of this website are © 2015 + <a href="https://www.apache.org/">Apache Software Foundation</a> + under the terms of the <a + href="https://www.apache.org/licenses/LICENSE-2.0.html"> + Apache License v2</a>. Apache ORC and its logo are trademarks + of the Apache Software Foundation.</p> +</footer> + + <script> + var anchorForId = function (id) { + var anchor = document.createElement("a"); + anchor.className = "header-link"; + anchor.href = "#" + id; + anchor.innerHTML = "<span class=\"sr-only\">Permalink</span><i class=\"fa fa-link\"></i>"; + anchor.title = "Permalink"; + return anchor; + }; + + var linkifyAnchors = function (level, containingElement) { + var headers = containingElement.getElementsByTagName("h" + level); + for (var h = 0; h < headers.length; h++) { + var header = headers[h]; + + if (typeof header.id !== "undefined" && header.id !== "") { + header.appendChild(anchorForId(header.id)); + } + } + }; + + document.onreadystatechange = function () { + if (this.readyState === "complete") { + var contentBlock = document.getElementsByClassName("docs")[0] || document.getElementsByClassName("news")[0]; + if (!contentBlock) { + return; + } + for (var level = 1; level <= 6; level++) { + linkifyAnchors(level, contentBlock); + } + } + }; +</script> + + +</body> +</html> http://git-wip-us.apache.org/repos/asf/orc/blob/6a400548/docs/spec-index.html ---------------------------------------------------------------------- diff --git a/docs/spec-index.html b/docs/spec-index.html new file mode 100644 index 0000000..0151fc2 --- /dev/null +++ b/docs/spec-index.html @@ -0,0 +1,1108 @@ +<!DOCTYPE HTML> +<html lang="en-US"> +<head> + <meta charset="UTF-8"> + <title>Indexes</title> + <meta name="viewport" content="width=device-width,initial-scale=1"> + <meta name="generator" content="Jekyll v2.4.0"> + <link rel="stylesheet" href="//fonts.googleapis.com/css?family=Lato:300,300italic,400,400italic,700,700italic,900"> + <link rel="stylesheet" href="/css/screen.css"> + <link rel="icon" type="image/x-icon" href="/favicon.ico"> + <!--[if lt IE 9]> + <script src="/js/html5shiv.min.js"></script> + <script src="/js/respond.min.js"></script> + <![endif]--> +</head> + + +<body class="wrap"> + <header role="banner"> + <nav class="mobile-nav show-on-mobiles"> + <ul> + <li class=""> + <a href="/">Home</a> + </li> + <li class="current"> + <a href="/docs/">Documentation</a> + </li> + <li class=""> + <a href="/talks/">Talks</a> + </li> + <li class=""> + <a href="/news/">News</a> + </li> + <li class=""> + <a href="/help/">Help</a> + </li> + <li class=""> + <a href="/develop/">Develop</a> + </li> +</ul> + + </nav> + <div class="grid"> + <div class="unit one-third center-on-mobiles"> + <h1> + <a href="/"> + <span class="sr-only">Apache ORC</span> + <img src="/img/logo.png" width="249" height="115" alt="ORC Logo"> + </a> + </h1> + </div> + <nav class="main-nav unit two-thirds hide-on-mobiles"> + <ul> + <li class=""> + <a href="/">Home</a> + </li> + <li class="current"> + <a href="/docs/">Documentation</a> + </li> + <li class=""> + <a href="/talks/">Talks</a> + </li> + <li class=""> + <a href="/news/">News</a> + </li> + <li class=""> + <a href="/help/">Help</a> + </li> + <li class=""> + <a href="/develop/">Develop</a> + </li> +</ul> + + </nav> + </div> +</header> + + + <section class="docs"> + <div class="grid"> + + <div class="docs-nav-mobile unit whole show-on-mobiles"> + <select onchange="if (this.value) window.location.href=this.value"> + <option value="">Navigate the docsâ¦</option> + + <optgroup label="Overview"> + + + + + + + + + + + + + + + + + + + + <option value="/docs/index.html">Background</option> + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + <option value="/docs/types.html">Types</option> + + + + + + + + + + + + + + + + + + + + + + <option value="/docs/indexes.html">Indexes</option> + + + + + + + + + + + + + + + + + + <option value="/docs/acid.html">ACID support</option> + + + + + + + + + + + + + + + + + + + + + + + + + + + + + </optgroup> + + <optgroup label="Hive Usage"> + + + + + + + + + + + + + + + + + + <option value="/docs/hive-ddl.html">Hive DDL</option> + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + <option value="/docs/hive-config.html">Hive Configuration</option> + + + + + + + + + + + + + + + + + + + + + </optgroup> + + <optgroup label="Format Specification"> + + + + + + + + + + + + + + + + + + + + + + + + + + + + <option value="/docs/spec-intro.html">Introduction</option> + + + + + + + + + + + + + + + + + + <option value="/docs/file-tail.html">File Tail</option> + + + + + + + + + + + + + + + + + + + + + + + + + + + + <option value="/docs/compression.html">Compression</option> + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + <option value="/docs/run-length.html">Run Length Encoding</option> + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + <option value="/docs/stripes.html">Stripes</option> + + + + + + + + + + + + + + <option value="/docs/encodings.html">Column Encodings</option> + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + <option value="/docs/spec-index.html">Indexes</option> + + + + + + + + + + + </optgroup> + + </select> +</div> + + + <div class="unit four-fifths"> + <article> + <h1>Indexes</h1> + <h1 id="row-group-index">Row Group Index</h1> + +<p>The row group indexes consist of a ROW_INDEX stream for each primitive +column that has an entry for each row group. Row groups are controlled +by the writer and default to 10,000 rows. Each RowIndexEntry gives the +position of each stream for the column and the statistics for that row +group.</p> + +<p>The index streams are placed at the front of the stripe, because in +the default case of streaming they do not need to be read. They are +only loaded when either predicate push down is being used or the +reader seeks to a particular row.</p> + +<p><code>message RowIndexEntry { + repeated uint64 positions = 1 [packed=true]; + optional ColumnStatistics statistics = 2; +} +</code></p> + +<p><code>message RowIndex { + repeated RowIndexEntry entry = 1; +} +</code></p> + +<p>To record positions, each stream needs a sequence of numbers. For +uncompressed streams, the position is the byte offset of the RLE runâs +start location followed by the number of values that need to be +consumed from the run. In compressed streams, the first number is the +start of the compression chunk in the stream, followed by the number +of decompressed bytes that need to be consumed, and finally the number +of values consumed in the RLE.</p> + +<p>For columns with multiple streams, the sequences of positions in each +stream are concatenated. That was an unfortunate decision on my part +that we should fix at some point, because it makes code that uses the +indexes error-prone.</p> + +<p>Because dictionaries are accessed randomly, there is not a position to +record for the dictionary and the entire dictionary must be read even +if only part of a stripe is being read.</p> + +<h1 id="bloom-filter-index">Bloom Filter Index</h1> + +<p>Bloom Filters are added to ORC indexes from Hive 1.2.0 onwards. +Predicate pushdown can make use of bloom filters to better prune +the row groups that do not satisfy the filter condition. +The bloom filter indexes consist of a BLOOM_FILTER stream for each +column specified through âorc.bloom.filter.columnsâ table properties. +A BLOOM_FILTER stream records a bloom filter entry for each row +group (default to 10,000 rows) in a column. Only the row groups that +satisfy min/max row index evaluation will be evaluated against the +bloom filter index.</p> + +<p>Each BloomFilterEntry stores the number of hash functions (âkâ) used and +the bitset backing the bloom filter. The bitset is serialized as repeated +longs from which the number of bits (âmâ) for the bloom filter can be derived. +m = bitset.length * 64.</p> + +<p><code>message BloomFilter { + optional uint32 numHashFunctions = 1; + repeated fixed64 bitset = 2; +} +</code></p> + +<p><code>message BloomFilterIndex { + repeated BloomFilter bloomFilter = 1; +} +</code></p> + +<p>Bloom filter internally uses two different hash functions to map a key +to a position in the bit set. For tinyint, smallint, int, bigint, float +and double types, Thomas Wangâs 64-bit integer hash function is used. +Floats are converted to IEEE-754 32 bit representation +(using Javaâs Float.floatToIntBits(float)). Similary, Doubles are +converted to IEEE-754 64 bit representation (using Javaâs +Double.doubleToLongBits(double)). All these primitive types +are cast to long base type before being passed on to the hash function. +For strings and binary types, Murmur3 64 bit hash algorithm is used. +The 64 bit variant of Murmur3 considers only the most significant +8 bytes of Murmur3 128-bit algorithm. The 64 bit hashcode generated +from the above algorithms is used as a base to derive âkâ different +hash functions. We use the idea mentioned in the paper âLess Hashing, +Same Performance: Building a Better Bloom Filterâ by Kirsch et. al. to +quickly compute the k hashcodes.</p> + +<p>The algorithm for computing k hashcodes and setting the bit position +in a bloom filter is as follows:</p> + +<ol> + <li>Get 64 bit base hash code from Murmur3 or Thomas Wangâs hash algorithm.</li> + <li>Split the above hashcode into two 32-bit hashcodes (say hash1 and hash2).</li> + <li>kâth hashcode is obtained by (where k > 0): + <ul> + <li>combinedHash = hash1 + (k * hash2)</li> + </ul> + </li> + <li>If combinedHash is negative flip all the bits: + <ul> + <li>combinedHash = ~combinedHash</li> + </ul> + </li> + <li>Bit set position is obtained by performing modulo with m: + <ul> + <li>position = combinedHash % m</li> + </ul> + </li> + <li>Set the position in bit set. The LSB 6 bits identifies the long index +within bitset and bit position within the long uses little endian order. + <ul> + <li>bitset[position »> 6] |= (1L « position);</li> + </ul> + </li> +</ol> + +<p>Bloom filter streams are interlaced with row group indexes. This placement +makes it convenient to read the bloom filter stream and row index stream +together in single read operation.</p> + +<p><img src="/img/BloomFilter.png" alt="bloom filter" /></p> + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + <div class="section-nav"> + <div class="left align-right"> + + + + <a href="/docs/encodings.html" class="prev">Back</a> + + </div> + <div class="right align-left"> + + <span class="next disabled">Next</span> + + </div> + </div> + <div class="clear"></div> + + + </article> + </div> + + <div class="unit one-fifth hide-on-mobiles"> + <aside> + + <h4>Overview</h4> + + +<ul> + + + + + + + + + + + + + + + + + + + + + + <li class=""><a href="/docs/index.html">Background</a></li> + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + <li class=""><a href="/docs/types.html">Types</a></li> + + + + + + + + + + + + + + + + + + + + + + + + + + <li class=""><a href="/docs/indexes.html">Indexes</a></li> + + + + + + + + + + + + <li class=""><a href="/docs/acid.html">ACID support</a></li> + + + +</ul> + + + <h4>Hive Usage</h4> + + +<ul> + + + + + + + + + + + + + + + + + + + + <li class=""><a href="/docs/hive-ddl.html">Hive DDL</a></li> + + + + + + + + + + + + + + + + + + + + <li class=""><a href="/docs/hive-config.html">Hive Configuration</a></li> + + + +</ul> + + + <h4>Format Specification</h4> + + +<ul> + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + <li class=""><a href="/docs/spec-intro.html">Introduction</a></li> + + + + + + + + + + + + + + + + + + <li class=""><a href="/docs/file-tail.html">File Tail</a></li> + + + + + + + + + + + + + + <li class=""><a href="/docs/compression.html">Compression</a></li> + + + + + + + + + + + + + + + + + + + + + + + + + + + + <li class=""><a href="/docs/run-length.html">Run Length Encoding</a></li> + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + <li class=""><a href="/docs/stripes.html">Stripes</a></li> + + + + + + + + + + + + + + + + <li class=""><a href="/docs/encodings.html">Column Encodings</a></li> + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + <li class="current"><a href="/docs/spec-index.html">Indexes</a></li> + + + +</ul> + + + </aside> +</div> + + + <div class="clear"></div> + + </div> + </section> + + + <footer role="contentinfo"> + <p>The contents of this website are © 2015 + <a href="https://www.apache.org/">Apache Software Foundation</a> + under the terms of the <a + href="https://www.apache.org/licenses/LICENSE-2.0.html"> + Apache License v2</a>. Apache ORC and its logo are trademarks + of the Apache Software Foundation.</p> +</footer> + + <script> + var anchorForId = function (id) { + var anchor = document.createElement("a"); + anchor.className = "header-link"; + anchor.href = "#" + id; + anchor.innerHTML = "<span class=\"sr-only\">Permalink</span><i class=\"fa fa-link\"></i>"; + anchor.title = "Permalink"; + return anchor; + }; + + var linkifyAnchors = function (level, containingElement) { + var headers = containingElement.getElementsByTagName("h" + level); + for (var h = 0; h < headers.length; h++) { + var header = headers[h]; + + if (typeof header.id !== "undefined" && header.id !== "") { + header.appendChild(anchorForId(header.id)); + } + } + }; + + document.onreadystatechange = function () { + if (this.readyState === "complete") { + var contentBlock = document.getElementsByClassName("docs")[0] || document.getElementsByClassName("news")[0]; + if (!contentBlock) { + return; + } + for (var level = 1; level <= 6; level++) { + linkifyAnchors(level, contentBlock); + } + } + }; +</script> + + +</body> +</html> http://git-wip-us.apache.org/repos/asf/orc/blob/6a400548/docs/spec-intro.html ---------------------------------------------------------------------- diff --git a/docs/spec-intro.html b/docs/spec-intro.html new file mode 100644 index 0000000..5784360 --- /dev/null +++ b/docs/spec-intro.html @@ -0,0 +1,993 @@ +<!DOCTYPE HTML> +<html lang="en-US"> +<head> + <meta charset="UTF-8"> + <title>Introduction</title> + <meta name="viewport" content="width=device-width,initial-scale=1"> + <meta name="generator" content="Jekyll v2.4.0"> + <link rel="stylesheet" href="//fonts.googleapis.com/css?family=Lato:300,300italic,400,400italic,700,700italic,900"> + <link rel="stylesheet" href="/css/screen.css"> + <link rel="icon" type="image/x-icon" href="/favicon.ico"> + <!--[if lt IE 9]> + <script src="/js/html5shiv.min.js"></script> + <script src="/js/respond.min.js"></script> + <![endif]--> +</head> + + +<body class="wrap"> + <header role="banner"> + <nav class="mobile-nav show-on-mobiles"> + <ul> + <li class=""> + <a href="/">Home</a> + </li> + <li class="current"> + <a href="/docs/">Documentation</a> + </li> + <li class=""> + <a href="/talks/">Talks</a> + </li> + <li class=""> + <a href="/news/">News</a> + </li> + <li class=""> + <a href="/help/">Help</a> + </li> + <li class=""> + <a href="/develop/">Develop</a> + </li> +</ul> + + </nav> + <div class="grid"> + <div class="unit one-third center-on-mobiles"> + <h1> + <a href="/"> + <span class="sr-only">Apache ORC</span> + <img src="/img/logo.png" width="249" height="115" alt="ORC Logo"> + </a> + </h1> + </div> + <nav class="main-nav unit two-thirds hide-on-mobiles"> + <ul> + <li class=""> + <a href="/">Home</a> + </li> + <li class="current"> + <a href="/docs/">Documentation</a> + </li> + <li class=""> + <a href="/talks/">Talks</a> + </li> + <li class=""> + <a href="/news/">News</a> + </li> + <li class=""> + <a href="/help/">Help</a> + </li> + <li class=""> + <a href="/develop/">Develop</a> + </li> +</ul> + + </nav> + </div> +</header> + + + <section class="docs"> + <div class="grid"> + + <div class="docs-nav-mobile unit whole show-on-mobiles"> + <select onchange="if (this.value) window.location.href=this.value"> + <option value="">Navigate the docsâ¦</option> + + <optgroup label="Overview"> + + + + + + + + + + + + + + + + + + + + <option value="/docs/index.html">Background</option> + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + <option value="/docs/types.html">Types</option> + + + + + + + + + + + + + + + + + + + + + + <option value="/docs/indexes.html">Indexes</option> + + + + + + + + + + + + + + + + + + <option value="/docs/acid.html">ACID support</option> + + + + + + + + + + + + + + + + + + + + + + + + + + + + + </optgroup> + + <optgroup label="Hive Usage"> + + + + + + + + + + + + + + + + + + <option value="/docs/hive-ddl.html">Hive DDL</option> + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + <option value="/docs/hive-config.html">Hive Configuration</option> + + + + + + + + + + + + + + + + + + + + + </optgroup> + + <optgroup label="Format Specification"> + + + + + + + + + + + + + + + + + + + + + + + + + + + + <option value="/docs/spec-intro.html">Introduction</option> + + + + + + + + + + + + + + + + + + <option value="/docs/file-tail.html">File Tail</option> + + + + + + + + + + + + + + + + + + + + + + + + + + + + <option value="/docs/compression.html">Compression</option> + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + <option value="/docs/run-length.html">Run Length Encoding</option> + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + <option value="/docs/stripes.html">Stripes</option> + + + + + + + + + + + + + + <option value="/docs/encodings.html">Column Encodings</option> + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + <option value="/docs/spec-index.html">Indexes</option> + + + + + + + + + + + </optgroup> + + </select> +</div> + + + <div class="unit four-fifths"> + <article> + <h1>Introduction</h1> + <p>Hiveâs RCFile was the standard format for storing tabular data in +Hadoop for several years. However, RCFile has limitations because it +treats each column as a binary blob without semantics. In Hive 0.11 we +added a new file format named Optimized Row Columnar (ORC) file that +uses and retains the type information from the table definition. ORC +uses type specific readers and writers that provide light weight +compression techniques such as dictionary encoding, bit packing, delta +encoding, and run length encoding â resulting in dramatically smaller +files. Additionally, ORC can apply generic compression using zlib, or +Snappy on top of the lightweight compression for even smaller +files. However, storage savings are only part of the gain. ORC +supports projection, which selects subsets of the columns for reading, +so that queries reading only one column read only the required +bytes. Furthermore, ORC files include light weight indexes that +include the minimum and maximum values for each column in each set of +10,000 rows and the entire file. Using pushdown filters from Hive, the +file reader can skip entire sets of rows that arenât important for +this query.</p> + +<p><img src="/img/OrcFileLayout.png" alt="ORC file structure" /></p> + + + + + + + + + + + + + + + + + + + + + + + + + + + + <div class="section-nav"> + <div class="left align-right"> + + + + <a href="/docs/hive-config.html" class="prev">Back</a> + + </div> + <div class="right align-left"> + + + + <a href="/docs/file-tail.html" class="next">Next</a> + + </div> + </div> + <div class="clear"></div> + + + </article> + </div> + + <div class="unit one-fifth hide-on-mobiles"> + <aside> + + <h4>Overview</h4> + + +<ul> + + + + + + + + + + + + + + + + + + + + + + <li class=""><a href="/docs/index.html">Background</a></li> + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + <li class=""><a href="/docs/types.html">Types</a></li> + + + + + + + + + + + + + + + + + + + + + + + + + + <li class=""><a href="/docs/indexes.html">Indexes</a></li> + + + + + + + + + + + + <li class=""><a href="/docs/acid.html">ACID support</a></li> + + + +</ul> + + + <h4>Hive Usage</h4> + + +<ul> + + + + + + + + + + + + + + + + + + + + <li class=""><a href="/docs/hive-ddl.html">Hive DDL</a></li> + + + + + + + + + + + + + + + + + + + + <li class=""><a href="/docs/hive-config.html">Hive Configuration</a></li> + + + +</ul> + + + <h4>Format Specification</h4> + + +<ul> + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + <li class="current"><a href="/docs/spec-intro.html">Introduction</a></li> + + + + + + + + + + + + + + + + + + <li class=""><a href="/docs/file-tail.html">File Tail</a></li> + + + + + + + + + + + + + + <li class=""><a href="/docs/compression.html">Compression</a></li> + + + + + + + + + + + + + + + + + + + + + + + + + + + + <li class=""><a href="/docs/run-length.html">Run Length Encoding</a></li> + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + <li class=""><a href="/docs/stripes.html">Stripes</a></li> + + + + + + + + + + + + + + + + <li class=""><a href="/docs/encodings.html">Column Encodings</a></li> + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + <li class=""><a href="/docs/spec-index.html">Indexes</a></li> + + + +</ul> + + + </aside> +</div> + + + <div class="clear"></div> + + </div> + </section> + + + <footer role="contentinfo"> + <p>The contents of this website are © 2015 + <a href="https://www.apache.org/">Apache Software Foundation</a> + under the terms of the <a + href="https://www.apache.org/licenses/LICENSE-2.0.html"> + Apache License v2</a>. Apache ORC and its logo are trademarks + of the Apache Software Foundation.</p> +</footer> + + <script> + var anchorForId = function (id) { + var anchor = document.createElement("a"); + anchor.className = "header-link"; + anchor.href = "#" + id; + anchor.innerHTML = "<span class=\"sr-only\">Permalink</span><i class=\"fa fa-link\"></i>"; + anchor.title = "Permalink"; + return anchor; + }; + + var linkifyAnchors = function (level, containingElement) { + var headers = containingElement.getElementsByTagName("h" + level); + for (var h = 0; h < headers.length; h++) { + var header = headers[h]; + + if (typeof header.id !== "undefined" && header.id !== "") { + header.appendChild(anchorForId(header.id)); + } + } + }; + + document.onreadystatechange = function () { + if (this.readyState === "complete") { + var contentBlock = document.getElementsByClassName("docs")[0] || document.getElementsByClassName("news")[0]; + if (!contentBlock) { + return; + } + for (var level = 1; level <= 6; level++) { + linkifyAnchors(level, contentBlock); + } + } + }; +</script> + + +</body> +</html>
