orc git commit: Add docs for C++ tools and core API.

gangwu Thu, 12 Apr 2018 13:41:29 -0700

Repository: orc
Updated Branches:
  refs/heads/master 98fdba9a3 -> 473e69ca7



Add docs for C++ tools and core API.

Fixes #244

Signed-off-by: Gang Wu <[email protected]>


Project: http://git-wip-us.apache.org/repos/asf/orc/repo
Commit: http://git-wip-us.apache.org/repos/asf/orc/commit/473e69ca
Tree: http://git-wip-us.apache.org/repos/asf/orc/tree/473e69ca
Diff: http://git-wip-us.apache.org/repos/asf/orc/diff/473e69ca

Branch: refs/heads/master
Commit: 473e69ca74f36b705dbeaeb74bd0a51715e44cdb
Parents: 98fdba9
Author: Gang Wu <[email protected]>
Authored: Wed Apr 11 13:40:22 2018 -0700
Committer: Gang Wu <[email protected]>
Committed: Thu Apr 12 13:39:32 2018 -0700

----------------------------------------------------------------------
 site/_data/docs.yml      |   4 +-
 site/_docs/core-cpp.md   | 266 ++++++++++++++++++++++++++++++++++
 site/_docs/cpp-tools.md  | 268 ++++++++++++++++++++++++++++++++++
 site/_docs/java-tools.md | 247 ++++++++++++++++++++++++++++++++
 site/_docs/tools.md      | 326 ------------------------------------------
 5 files changed, 784 insertions(+), 327 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/orc/blob/473e69ca/site/_data/docs.yml
----------------------------------------------------------------------
diff --git a/site/_data/docs.yml b/site/_data/docs.yml
index 70087af..9730ac3 100644
--- a/site/_data/docs.yml
+++ b/site/_data/docs.yml
@@ -24,10 +24,12 @@
 - title: Using ORC Core
   docs:
   - core-java
+  - core-cpp
 
 - title: Tools
   docs:
-  - tools
+  - cpp-tools
+  - java-tools
 
 - title: Format Specification
   docs:

http://git-wip-us.apache.org/repos/asf/orc/blob/473e69ca/site/_docs/core-cpp.md
----------------------------------------------------------------------
diff --git a/site/_docs/core-cpp.md b/site/_docs/core-cpp.md
new file mode 100644
index 0000000..4b9f683
--- /dev/null
+++ b/site/_docs/core-cpp.md
@@ -0,0 +1,266 @@
+---
+layout: docs
+title: Using Core C++
+permalink: /docs/core-cpp.html
+---
+
+The C++ Core ORC API reads and writes ORC files into its own
+orc::ColumnVectorBatch vectorized classes.
+
+## Vectorized Row Batch
+
+Data is passed to ORC as instances of orc::ColumnVectorBatch
+that contain the data a batch of rows. The focus is on speed and
+accessing the data fields directly. `numElements` is the number
+of rows. ColumnVectorBatch is the parent type of the different
+kinds of columns and has some fields that are shared across
+all of the column types. In particular, the `hasNulls` flag
+if there is any null in this column for this batch. For columns
+where `hasNulls == true` the `notNull` buffer is false if that
+value is null.
+
+~~~ cpp
+namespace orc {
+  struct ColumnVectorBatch {
+    uint64_t numElements;
+    DataBuffer<char> notNull;
+    bool hasNulls;
+    ...
+  }
+}
+~~~
+
+The subtypes of ColumnVectorBatch are:
+
+| ORC Type | ColumnVectorBatch |
+| -------- | ------------- |
+| array | ListVectorBatch |
+| binary | StringVectorBatch |
+| bigint | LongVectorBatch |
+| boolean | LongVectorBatch |
+| char | StringVectorBatch |
+| date | LongVectorBatch |
+| decimal | Decimal64VectorBatch, Decimal128VectorBatch |
+| double | DoubleVectorBatch |
+| float | DoubleVectorBatch |
+| int | LongVectorBatch |
+| map | MapVectorBatch |
+| smallint | LongVectorBatch |
+| string | StringVectorBatch |
+| struct | StructVectorBatch |
+| timestamp | TimestampVectorBatch |
+| tinyint | LongVectorBatch |
+| uniontype | UnionVectorBatch |
+| varchar | StringVectorBatch |
+
+LongVectorBatch handles all of the integer types (boolean, bigint,
+date, int, smallint, and tinyint). The data is represented as a
+buffer of int64_t where each value is sign-extended as necessary.
+
+~~~ cpp
+  struct LongVectorBatch: public ColumnVectorBatch {
+    DataBuffer<int64_t> data;
+    ...
+  };
+~~~
+
+TimestampVectorBatch handles timestamp values. The data is
+represented as two buffers of int64_t for seconds and nanoseconds
+respectively. Note that we always assume data is in GMT timezone;
+therefore it is user's responsibility to convert wall clock time
+from local timezone to GMT.
+
+~~~ cpp
+  struct TimestampVectorBatch: public ColumnVectorBatch {
+    DataBuffer<int64_t> data;
+    DataBuffer<int64_t> nanoseconds;
+    ...
+  };
+~~~
+
+DoubleVectorBatch handles all of the floating point types
+(double, and float). The data is represented as a buffer of doubles.
+
+~~~ cpp
+  struct DoubleVectorBatch: public ColumnVectorBatch {
+    DataBuffer<double> data;
+    ...
+  };
+~~~
+
+Decimal64VectorBatch handles decimal columns with precision no
+greater than 18. Decimal128VectorBatch handles the others. The data
+is represented as a buffer of int64_t and orc::Int128 respectively.
+
+~~~ cpp
+  struct Decimal64VectorBatch: public ColumnVectorBatch {
+    DataBuffer<int64_t> values;
+    ...
+  };
+
+  struct Decimal128VectorBatch: public ColumnVectorBatch {
+    DataBuffer<Int128> values;
+    ...
+  };
+~~~
+
+StringVectorBatch handles all of the binary types (binary,
+char, string, and varchar). The data is represented as a char* buffer,
+and a length buffer.
+
+~~~ cpp
+  struct StringVectorBatch: public ColumnVectorBatch {
+    DataBuffer<char*> data;
+    DataBuffer<int64_t> length;
+    ...
+  };
+~~~
+
+StructVectorBatch handles the struct columns and represents
+the data as a buffer of `ColumnVectorBatch`.
+
+~~~ cpp
+  struct StructVectorBatch: public ColumnVectorBatch {
+    std::vector<ColumnVectorBatch*> fields;
+    ...
+  };
+~~~
+
+UnionVectorBatch handles the union columns. It uses `tags`
+to indicate which subtype has the value and `offsets` indicates
+the offset in child batch of that subtype. A individual
+`ColumnVectorBatch` is used for each subtype.
+
+~~~ cpp
+  struct UnionVectorBatch: public ColumnVectorBatch {
+    DataBuffer<unsigned char> tags;
+    DataBuffer<uint64_t> offsets;
+    std::vector<ColumnVectorBatch*> children;
+    ...
+  };
+~~~
+
+ListVectorBatch handles the array columns and represents
+the data as a buffer of integers for the offsets and a
+`ColumnVectorBatch` for the children values.
+
+~~~ cpp
+  struct ListVectorBatch: public ColumnVectorBatch {
+    DataBuffer<int64_t> offsets;
+    ORC_UNIQUE_PTR<ColumnVectorBatch> elements;
+    ...
+  };
+~~~
+
+MapVectorBatch handles the map columns and represents the data
+as two arrays of integers for the offsets and two `ColumnVectorBatch`s
+for the keys and values.
+
+~~~ cpp
+  struct MapVectorBatch: public ColumnVectorBatch {
+    DataBuffer<int64_t> offsets;
+    ORC_UNIQUE_PTR<ColumnVectorBatch> keys;
+    ORC_UNIQUE_PTR<ColumnVectorBatch> elements;
+    ...
+  };
+~~~
+
+## Writing ORC Files
+
+To write an ORC file, you need to include `OrcFile.hh` and define
+the schema; then use `orc::OutputStream` and `orc::WriterOptions`
+to create a `orc::Writer` with the desired filename. This example
+sets the required schema parameter, but there are many other
+options to control the ORC writer.
+
+~~~ cpp
+ORC_UNIQUE_PTR<OutputStream> outStream =
+  writeLocalFile("my-file.orc");
+ORC_UNIQUE_PTR<Type> schema(
+  Type::buildTypeFromString("struct<x:int,y:int>"));
+WriterOptions options;
+ORC_UNIQUE_PTR<Writer> writer =
+  createWriter(*schema, outStream.get(), options);
+~~~
+
+Now you need to create a row batch, set the data, and write it to the file
+as the batch fills up. When the file is done, close the `Writer`.
+
+~~~ cpp
+uint64_t batchSize = 1024, rowCount = 10000;
+ORC_UNIQUE_PTR<ColumnVectorBatch> batch =
+  writer->createRowBatch(batchSize);
+StructVectorBatch *root =
+  dynamic_cast<StructVectorBatch *>(batch.get());
+LongVectorBatch *x =
+  dynamic_cast<LongVectorBatch *>(root->fields[0]);
+LongVectorBatch *y =
+  dynamic_cast<LongVectorBatch *>(root->fields[1]);
+
+uint64_t rows = 0;
+for (uint64_t i = 0; i < rowCount; ++i) {
+  x->data[rows] = i;
+  y->data[rows] = i * 3;
+  rows++;
+
+  if (rows == batchSize) {
+    root->numElements = rows;
+    x->numElements = rows;
+    y->numElements = rows;
+
+    writer->add(*batch);
+    rows = 0;
+  }
+}
+
+if (rows != 0) {
+  root->numElements = rows;
+  x->numElements = rows;
+  y->numElements = rows;
+
+  writer->add(*batch);
+  rows = 0;
+}
+
+writer->close();
+~~~
+
+## Reading ORC Files
+
+To read ORC files, include `OrcFile.hh` file to create a `orc::Reader`
+that contains the metadata about the file. There are a few options to
+the `orc::Reader`, but far fewer than the writer and none of them are
+required. The reader has methods for getting the number of rows,
+schema, compression, etc. from the file.
+
+~~~ cpp
+ORC_UNIQUE_PTR<InputStream> inStream =
+  readLocalFile("my-file.orc");
+ReaderOptions options;
+ORC_UNIQUE_PTR<Reader> reader =
+  createReader(inStream, options);
+~~~
+
+To get the data, create a `orc::RowReader` object. By default,
+the RowReader reads all rows and all columns, but there are
+options to control the data that is read.
+
+~~~ cpp
+RowReaderOptions rowReaderOptions;
+ORC_UNIQUE_PTR<RowReader> rowReader =
+  reader->createRowReader(rowReaderOptions);
+ORC_UNIQUE_PTR<ColumnVectorBatch> batch =
+  rowReader->createRowBatch(1024);
+~~~
+
+With a `orc::RowReader` the user can ask for the next batch until there
+are no more left. The reader will stop the batch at certain boundaries,
+so the returned batch may not be full, but it will always contain some rows.
+
+~~~ cpp
+while (rowReader->next(*batch)) {
+  for (uint64_t r = 0; r < batch->numElements; ++r) {
+    ... process row r from batch
+  }
+}
+~~~

http://git-wip-us.apache.org/repos/asf/orc/blob/473e69ca/site/_docs/cpp-tools.md
----------------------------------------------------------------------
diff --git a/site/_docs/cpp-tools.md b/site/_docs/cpp-tools.md
new file mode 100644
index 0000000..d4d6e75
--- /dev/null
+++ b/site/_docs/cpp-tools.md
@@ -0,0 +1,268 @@
+---
+layout: docs
+title: C++ Tools
+permalink: /docs/cpp-tools.html
+---
+
+## orc-contents
+
+Displays the contents of the ORC file as a JSON document. With the
+`columns` argument only the selected columns are printed.
+
+~~~ shell
+% orc-contents  [--columns=1,2,...] <filename>
+~~~
+
+If you run it on the example file TestOrcFile.test1.orc, you'll see (without
+the line breaks within each record):
+
+~~~ shell
+% orc-contents examples/TestOrcFile.test1.orc
+{"boolean1": false, "byte1": 1, "short1": 1024, "int1": 65536, \\
+ "long1": 9223372036854775807, "float1": 1, "double1": -15, \\
+ "bytes1": [0, 1, 2, 3, 4], "string1": "hi", "middle": \\
+    {"list": [{"int1": 1, "string1": "bye"}, \\
+              {"int1": 2, "string1": "sigh"}]}, \\
+ "list": [{"int1": 3, "string1": "good"}, \\
+          {"int1": 4, "string1": "bad"}], \\
+ "map": []}
+{"boolean1": true, "byte1": 100, "short1": 2048, "int1": 65536,
+ "long1": 9223372036854775807, "float1": 2, "double1": -5, \\
+ "bytes1": [], "string1": "bye", \\
+ "middle": {"list": [{"int1": 1, "string1": "bye"}, \\
+                     {"int1": 2, "string1": "sigh"}]}, \\
+ "list": [{"int1": 100000000, "string1": "cat"}, \\
+          {"int1": -100000, "string1": "in"}, \\
+          {"int1": 1234, "string1": "hat"}], \\
+ "map": [{"key": "chani", "value": {"int1": 5, "string1": "chani"}}, \\
+         {"key": "mauddib", \\
+          "value": {"int1": 1, "string1": "mauddib"}}]}
+~~~
+
+## orc-metadata
+
+Displays the metadata of the ORC file as a JSON document. With the
+`verbose` option additional information about the layout of the file
+is also printed.
+
+For diagnosing problems, it is useful to use the '--raw' option that
+prints the protocol buffers from the ORC file directly rather than
+interpreting them.
+
+~~~ shell
+% orc-metadata [-v] [--raw] <filename>
+~~~
+
+If you run it on the example file TestOrcFile.test1.orc, you'll see:
+
+~~~ shell
+% orc-metadata examples/TestOrcFile.test1.orc
+{ "name": "../examples/TestOrcFile.test1.orc",
+  "type": "struct<boolean1:boolean,byte1:tinyint,short1:smallint,
+int1:int,long1:bigint,float1:float,double1:double,bytes1:binary,
+string1:string,middle:struct<list:array<struct<int1:int,string1:
+string>>>,list:array<struct<int1:int,string1:string>>,map:map<
+string,struct<int1:int,string1:string>>>",
+  "rows": 2,
+  "stripe count": 1,
+  "format": "0.12", "writer version": "HIVE-8732",
+  "compression": "zlib", "compression block": 10000,
+  "file length": 1711,
+  "content": 1015, "stripe stats": 250, "footer": 421, "postscript": 24,
+  "row index stride": 10000,
+  "user metadata": {
+  },
+  "stripes": [
+    { "stripe": 0, "rows": 2,
+      "offset": 3, "length": 1012,
+      "index": 570, "data": 243, "footer": 199
+    }
+  ]
+}
+~~~
+
+## csv-import
+
+Imports CSV file into an Orc file using the specified schema.
+Compound types are not yet supported. `delimiter` option indicates
+the delimiter in the input CSV file and by default is `,`. `stripe`
+option means the stripe size and set to 128MB by default. `block`
+option is compression block size which is 64KB by default. `batch`
+option is by default 1024 rows for one batch.
+
+~~~ shell
+% csv-import [--delimiter=<character>] [--stripe=<size>]
+             [--block=<size>] [--batch=<size>]
+             <schema> <inputCSVFile> <outputORCFile>
+~~~
+
+If you run it on the example file TestCSVFileImport.test10rows.csv,
+you'll see:
+
+~~~ shell
+% csv-import "struct<a:bigint,b:string,c:double>"
+             examples/TestCSVFileImport.test10rows.csv /tmp/test.orc
+[2018-04-11 11:12:16] Start importing Orc file...
+[2018-04-11 11:12:16] Finish importing Orc file.
+[2018-04-11 11:12:16] Total writer elasped time: 0.001352s.
+[2018-04-11 11:12:16] Total writer CPU time: 0.001339s.
+~~~
+
+## orc-scan
+
+Scans and displays the row count of the ORC file. With the `batch` option
+to set the batch size which is 1024 rows by default. It is useful to check
+if the ORC file is damaged.
+
+~~~ shell
+% orc-scan [--batch=<size>] <filename>
+~~~
+
+If you run it on the example file TestOrcFile.test1.orc, you'll see:
+
+~~~ shell
+% orc-scan examples/TestOrcFile.test1.orc
+Rows: 2
+Batches: 1
+~~~
+
+## orc-statistics
+
+Displays the file-level and stripe-level column statistics of the ORC file.
+With the `withIndex` option to include column statistics in each row group.
+
+~~~ shell
+% orc-statistics [--withIndex] <filename>
+~~~
+
+If you run it on the example file TestOrcFile.TestOrcFile.columnProjection.orc
+you'll see:
+
+~~~ shell
+% orc-statistics examples/TestOrcFile.columnProjection.orc
+File examples/TestOrcFile.columnProjection.orc has 3 columns
+*** Column 0 ***
+Column has 21000 values and has null value: no
+
+*** Column 1 ***
+Data type: Integer
+Values: 21000
+Has null: no
+Minimum: -2147439072
+Maximum: 2147257982
+Sum: 268482658568
+
+*** Column 2 ***
+Data type: String
+Values: 21000
+Has null: no
+Minimum: 100119c272d7db89
+Maximum: fffe9f6f23b287f3
+Total length: 334559
+
+File examples/TestOrcFile.columnProjection.orc has 5 stripes
+*** Stripe 0 ***
+
+--- Column 0 ---
+Column has 5000 values and has null value: no
+
+--- Column 1 ---
+Data type: Integer
+Values: 5000
+Has null: no
+Minimum: -2145365268
+Maximum: 2147025027
+Sum: -29841423854
+
+--- Column 2 ---
+Data type: String
+Values: 5000
+Has null: no
+Minimum: 1005350489418be2
+Maximum: fffbb8718c92b09f
+Total length: 79644
+
+*** Stripe 1 ***
+
+--- Column 0 ---
+Column has 5000 values and has null value: no
+
+--- Column 1 ---
+Data type: Integer
+Values: 5000
+Has null: no
+Minimum: -2147115959
+Maximum: 2147257982
+Sum: 108604887785
+
+--- Column 2 ---
+Data type: String
+Values: 5000
+Has null: no
+Minimum: 100119c272d7db89
+Maximum: fff0ae41d41e6afc
+Total length: 79640
+
+*** Stripe 2 ***
+
+--- Column 0 ---
+Column has 5000 values and has null value: no
+
+--- Column 1 ---
+Data type: Integer
+Values: 5000
+Has null: no
+Minimum: -2145932387
+Maximum: 2145877119
+Sum: 70064190848
+
+--- Column 2 ---
+Data type: String
+Values: 5000
+Has null: no
+Minimum: 10130af874ae036c
+Maximum: fffe9f6f23b287f3
+Total length: 79645
+
+*** Stripe 3 ***
+
+--- Column 0 ---
+Column has 5000 values and has null value: no
+
+--- Column 1 ---
+Data type: Integer
+Values: 5000
+Has null: no
+Minimum: -2147439072
+Maximum: 2147074354
+Sum: 104681356482
+
+--- Column 2 ---
+Data type: String
+Values: 5000
+Has null: no
+Minimum: 102547d48ed06518
+Maximum: fffa47c57dc7b69a
+Total length: 79689
+
+*** Stripe 4 ***
+
+--- Column 0 ---
+Column has 1000 values and has null value: no
+
+--- Column 1 ---
+Data type: Integer
+Values: 1000
+Has null: no
+Minimum: -2141222223
+Maximum: 2145816096
+Sum: 14973647307
+
+--- Column 2 ---
+Data type: String
+Values: 1000
+Has null: no
+Minimum: 1059d81c9025a217
+Maximum: ffc17f0e35e1a6c0
+Total length: 15941
+~~~
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/orc/blob/473e69ca/site/_docs/java-tools.md
----------------------------------------------------------------------
diff --git a/site/_docs/java-tools.md b/site/_docs/java-tools.md
new file mode 100644
index 0000000..3855955
--- /dev/null
+++ b/site/_docs/java-tools.md
@@ -0,0 +1,247 @@
+---
+layout: docs
+title: Java Tools
+permalink: /docs/java-tools.html
+---
+
+In addition to the C++ tools, there is an ORC tools jar that packages
+several useful utilities and the necessary Java dependencies
+(including Hadoop) into a single package. The Java ORC tool jar
+supports both the local file system and HDFS.
+
+The subcommands for the tools are:
+
+  * meta - print the metadata of an ORC file
+  * data - print the data of an ORC file
+  * scan (since ORC 1.3) - scan the data for benchmarking
+  * convert (since ORC 1.4) - convert JSON files to ORC
+  * json-schema (since ORC 1.4) - determine the schema of JSON documents
+
+The command line looks like:
+
+~~~ shell
+% java -jar orc-tools-X.Y.Z-uber.jar <sub-command> <args>
+~~~
+
+## Java Meta
+
+The meta command prints the metadata about the given ORC file and is
+equivalent to the Hive ORC File Dump command.
+
+-j
+  : format the output in JSON
+
+-p
+  : pretty print the output
+
+-t
+  : print the timezone of the writer
+
+--rowindex
+  : print the row indexes for the comma separated list of column ids
+
+--recover
+  : skip over corrupted values in the ORC file
+
+--skip-dump
+  : skip dumping the metadata
+
+--backup-path
+  : when used with --recover specifies the path where the recovered file is 
written
+
+An example of the output is given below:
+
+~~~ shell
+% java -jar orc-tools-X.Y.Z-uber.jar meta examples/TestOrcFile.test1.orc
+Processing data file examples/TestOrcFile.test1.orc [length: 1711]
+Structure for examples/TestOrcFile.test1.orc
+File Version: 0.12 with HIVE_8732
+Rows: 2
+Compression: ZLIB
+Compression size: 10000
+Type: struct<boolean1:boolean,byte1:tinyint,short1:smallint,int1:int,
+long1:bigint,float1:float,double1:double,bytes1:binary,string1:string,
+middle:struct<list:array<struct<int1:int,string1:string>>>,list:array<
+struct<int1:int,string1:string>>,map:map<string,struct<int1:int,string1:
+string>>>
+
+Stripe Statistics:
+  Stripe 1:
+    Column 0: count: 2 hasNull: false
+    Column 1: count: 2 hasNull: false true: 1
+    Column 2: count: 2 hasNull: false min: 1 max: 100 sum: 101
+    Column 3: count: 2 hasNull: false min: 1024 max: 2048 sum: 3072
+    Column 4: count: 2 hasNull: false min: 65536 max: 65536 sum: 131072
+    Column 5: count: 2 hasNull: false min: 9223372036854775807 max: 
9223372036854775807
+    Column 6: count: 2 hasNull: false min: 1.0 max: 2.0 sum: 3.0
+    Column 7: count: 2 hasNull: false min: -15.0 max: -5.0 sum: -20.0
+    Column 8: count: 2 hasNull: false sum: 5
+    Column 9: count: 2 hasNull: false min: bye max: hi sum: 5
+    Column 10: count: 2 hasNull: false
+    Column 11: count: 2 hasNull: false
+    Column 12: count: 4 hasNull: false
+    Column 13: count: 4 hasNull: false min: 1 max: 2 sum: 6
+    Column 14: count: 4 hasNull: false min: bye max: sigh sum: 14
+    Column 15: count: 2 hasNull: false
+    Column 16: count: 5 hasNull: false
+    Column 17: count: 5 hasNull: false min: -100000 max: 100000000 sum: 
99901241
+    Column 18: count: 5 hasNull: false min: bad max: in sum: 15
+    Column 19: count: 2 hasNull: false
+    Column 20: count: 2 hasNull: false min: chani max: mauddib sum: 12
+    Column 21: count: 2 hasNull: false
+    Column 22: count: 2 hasNull: false min: 1 max: 5 sum: 6
+    Column 23: count: 2 hasNull: false min: chani max: mauddib sum: 12
+
+File Statistics:
+  Column 0: count: 2 hasNull: false
+  Column 1: count: 2 hasNull: false true: 1
+  Column 2: count: 2 hasNull: false min: 1 max: 100 sum: 101
+  Column 3: count: 2 hasNull: false min: 1024 max: 2048 sum: 3072
+  Column 4: count: 2 hasNull: false min: 65536 max: 65536 sum: 131072
+  Column 5: count: 2 hasNull: false min: 9223372036854775807 max: 
9223372036854775807
+  Column 6: count: 2 hasNull: false min: 1.0 max: 2.0 sum: 3.0
+  Column 7: count: 2 hasNull: false min: -15.0 max: -5.0 sum: -20.0
+  Column 8: count: 2 hasNull: false sum: 5
+  Column 9: count: 2 hasNull: false min: bye max: hi sum: 5
+  Column 10: count: 2 hasNull: false
+  Column 11: count: 2 hasNull: false
+  Column 12: count: 4 hasNull: false
+  Column 13: count: 4 hasNull: false min: 1 max: 2 sum: 6
+  Column 14: count: 4 hasNull: false min: bye max: sigh sum: 14
+  Column 15: count: 2 hasNull: false
+  Column 16: count: 5 hasNull: false
+  Column 17: count: 5 hasNull: false min: -100000 max: 100000000 sum: 99901241
+  Column 18: count: 5 hasNull: false min: bad max: in sum: 15
+  Column 19: count: 2 hasNull: false
+  Column 20: count: 2 hasNull: false min: chani max: mauddib sum: 12
+  Column 21: count: 2 hasNull: false
+  Column 22: count: 2 hasNull: false min: 1 max: 5 sum: 6
+  Column 23: count: 2 hasNull: false min: chani max: mauddib sum: 12
+
+Stripes:
+  Stripe: offset: 3 data: 243 rows: 2 tail: 199 index: 570
+    Stream: column 0 section ROW_INDEX start: 3 length 11
+    Stream: column 1 section ROW_INDEX start: 14 length 22
+    Stream: column 2 section ROW_INDEX start: 36 length 26
+    Stream: column 3 section ROW_INDEX start: 62 length 27
+    Stream: column 4 section ROW_INDEX start: 89 length 30
+    Stream: column 5 section ROW_INDEX start: 119 length 28
+    Stream: column 6 section ROW_INDEX start: 147 length 34
+    Stream: column 7 section ROW_INDEX start: 181 length 34
+    Stream: column 8 section ROW_INDEX start: 215 length 21
+    Stream: column 9 section ROW_INDEX start: 236 length 30
+    Stream: column 10 section ROW_INDEX start: 266 length 11
+    Stream: column 11 section ROW_INDEX start: 277 length 16
+    Stream: column 12 section ROW_INDEX start: 293 length 11
+    Stream: column 13 section ROW_INDEX start: 304 length 24
+    Stream: column 14 section ROW_INDEX start: 328 length 31
+    Stream: column 15 section ROW_INDEX start: 359 length 16
+    Stream: column 16 section ROW_INDEX start: 375 length 11
+    Stream: column 17 section ROW_INDEX start: 386 length 32
+    Stream: column 18 section ROW_INDEX start: 418 length 30
+    Stream: column 19 section ROW_INDEX start: 448 length 16
+    Stream: column 20 section ROW_INDEX start: 464 length 37
+    Stream: column 21 section ROW_INDEX start: 501 length 11
+    Stream: column 22 section ROW_INDEX start: 512 length 24
+    Stream: column 23 section ROW_INDEX start: 536 length 37
+    Stream: column 1 section DATA start: 573 length 5
+    Stream: column 2 section DATA start: 578 length 6
+    Stream: column 3 section DATA start: 584 length 9
+    Stream: column 4 section DATA start: 593 length 11
+    Stream: column 5 section DATA start: 604 length 12
+    Stream: column 6 section DATA start: 616 length 11
+    Stream: column 7 section DATA start: 627 length 15
+    Stream: column 8 section DATA start: 642 length 8
+    Stream: column 8 section LENGTH start: 650 length 6
+    Stream: column 9 section DATA start: 656 length 8
+    Stream: column 9 section LENGTH start: 664 length 6
+    Stream: column 11 section LENGTH start: 670 length 6
+    Stream: column 13 section DATA start: 676 length 7
+    Stream: column 14 section DATA start: 683 length 6
+    Stream: column 14 section LENGTH start: 689 length 6
+    Stream: column 14 section DICTIONARY_DATA start: 695 length 10
+    Stream: column 15 section LENGTH start: 705 length 6
+    Stream: column 17 section DATA start: 711 length 25
+    Stream: column 18 section DATA start: 736 length 18
+    Stream: column 18 section LENGTH start: 754 length 8
+    Stream: column 19 section LENGTH start: 762 length 6
+    Stream: column 20 section DATA start: 768 length 15
+    Stream: column 20 section LENGTH start: 783 length 6
+    Stream: column 22 section DATA start: 789 length 6
+    Stream: column 23 section DATA start: 795 length 15
+    Stream: column 23 section LENGTH start: 810 length 6
+    Encoding column 0: DIRECT
+    Encoding column 1: DIRECT
+    Encoding column 2: DIRECT
+    Encoding column 3: DIRECT_V2
+    Encoding column 4: DIRECT_V2
+    Encoding column 5: DIRECT_V2
+    Encoding column 6: DIRECT
+    Encoding column 7: DIRECT
+    Encoding column 8: DIRECT_V2
+    Encoding column 9: DIRECT_V2
+    Encoding column 10: DIRECT
+    Encoding column 11: DIRECT_V2
+    Encoding column 12: DIRECT
+    Encoding column 13: DIRECT_V2
+    Encoding column 14: DICTIONARY_V2[2]
+    Encoding column 15: DIRECT_V2
+    Encoding column 16: DIRECT
+    Encoding column 17: DIRECT_V2
+    Encoding column 18: DIRECT_V2
+    Encoding column 19: DIRECT_V2
+    Encoding column 20: DIRECT_V2
+    Encoding column 21: DIRECT
+    Encoding column 22: DIRECT_V2
+    Encoding column 23: DIRECT_V2
+
+File length: 1711 bytes
+Padding length: 0 bytes
+Padding ratio: 0%
+______________________________________________________________________
+~~~
+
+## Java Data
+
+The data command prints the data in an ORC file as a JSON document. Each
+record is printed as a JSON object on a line. Each record is annotated with
+the fieldnames and a JSON representation that depends on the field's type.
+
+## Java Scan
+
+The scan command reads the contents of the file without printing anything. It
+is primarily intendend for benchmarking the Java reader without including the
+cost of printing the data out.
+
+## Java Convert
+
+The convert command reads several JSON files and converts them into a
+single ORC file.
+
+-o <filename>
+  : Sets the output ORC filename, which defaults to output.orc
+
+-s <schema>
+  : Sets the schema for the ORC file. By default, the schema is automatically 
discovered.
+
+-h
+  : Print help
+  
+The automatic JSON schema discovery is equivalent to the json-schema tool
+below.
+
+## Java JSON Schema
+
+The JSON Schema discovery tool processes a set of JSON documents and
+produces a schema that encompasses all of the records in all of the
+documents. It works by computing the enclosing type and promoting it
+to include all of the observed values.
+
+-f
+  : Print the schema as a list of flat types for each subfield
+
+-t
+  : Print the schema as a Hive table declaration
+
+-h
+  : Print help
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/orc/blob/473e69ca/site/_docs/tools.md
----------------------------------------------------------------------
diff --git a/site/_docs/tools.md b/site/_docs/tools.md
deleted file mode 100644
index 04ff4fd..0000000
--- a/site/_docs/tools.md
+++ /dev/null
@@ -1,326 +0,0 @@
----
-layout: docs
-title: Tools
-permalink: /docs/tools.html
----
-
-## orc-contents
-
-Displays the contents of the ORC file as a JSON document. With the
-`columns` argument only the selected columns are printed.
-
-~~~ shell
-% orc-contents  [--columns=1,2,...] <filename>
-~~~
-
-If you run it on the example file TestOrcFile.test1.orc, you'll see (without
-the line breaks within each record):
-
-~~~ shell
-% orc-contents examples/TestOrcFile.test1.orc
-{"boolean1": false, "byte1": 1, "short1": 1024, "int1": 65536, \\
- "long1": 9223372036854775807, "float1": 1, "double1": -15, \\
- "bytes1": [0, 1, 2, 3, 4], "string1": "hi", "middle": \\
-    {"list": [{"int1": 1, "string1": "bye"}, \\
-              {"int1": 2, "string1": "sigh"}]}, \\
- "list": [{"int1": 3, "string1": "good"}, \\
-          {"int1": 4, "string1": "bad"}], \\
- "map": []}
-{"boolean1": true, "byte1": 100, "short1": 2048, "int1": 65536,
- "long1": 9223372036854775807, "float1": 2, "double1": -5, \\
- "bytes1": [], "string1": "bye", \\
- "middle": {"list": [{"int1": 1, "string1": "bye"}, \\
-                     {"int1": 2, "string1": "sigh"}]}, \\
- "list": [{"int1": 100000000, "string1": "cat"}, \\
-          {"int1": -100000, "string1": "in"}, \\
-          {"int1": 1234, "string1": "hat"}], \\
- "map": [{"key": "chani", "value": {"int1": 5, "string1": "chani"}}, \\
-         {"key": "mauddib", \\
-          "value": {"int1": 1, "string1": "mauddib"}}]}
-~~~
-
-## orc-metadata
-
-Displays the metadata of the ORC file as a JSON document. With the
-`verbose` option additional information about the layout of the file
-is also printed.
-
-For diagnosing problems, it is useful to use the '--raw' option that
-prints the protocol buffers from the ORC file directly rather than
-interpreting them.
-
-~~~ shell
-% orc-metadata [-v] [--raw] <filename>
-~~~
-
-If you run it on the example file TestOrcFile.test1.orc, you'll see:
-
-~~~ shell
-% orc-metadata examples/TestOrcFile.test1.orc
-{ "name": "../examples/TestOrcFile.test1.orc",
-  "type": "struct<boolean1:boolean,byte1:tinyint,short1:smallint,
-int1:int,long1:bigint,float1:float,double1:double,bytes1:binary,
-string1:string,middle:struct<list:array<struct<int1:int,string1:
-string>>>,list:array<struct<int1:int,string1:string>>,map:map<
-string,struct<int1:int,string1:string>>>",
-  "rows": 2,
-  "stripe count": 1,
-  "format": "0.12", "writer version": "HIVE-8732",
-  "compression": "zlib", "compression block": 10000,
-  "file length": 1711,
-  "content": 1015, "stripe stats": 250, "footer": 421, "postscript": 24,
-  "row index stride": 10000,
-  "user metadata": {
-  },
-  "stripes": [
-    { "stripe": 0, "rows": 2,
-      "offset": 3, "length": 1012,
-      "index": 570, "data": 243, "footer": 199
-    }
-  ]
-}
-~~~
-
-## Java ORC Tools
-
-In addition to the C++ tools above, there is an ORC tools jar that
-packages several useful utilities and the necessary Java dependencies
-(including Hadoop) into a single package. The Java ORC tool jar
-supports both the local file system and HDFS.
-
-The subcommands for the tools are:
-
-  * meta - print the metadata of an ORC file
-  * data - print the data of an ORC file
-  * scan (since ORC 1.3) - scan the data for benchmarking
-  * convert (since ORC 1.4) - convert JSON files to ORC
-  * json-schema (since ORC 1.4) - determine the schema of JSON documents
-
-The command line looks like:
-
-~~~ shell
-% java -jar orc-tools-X.Y.Z-uber.jar <sub-command> <args>
-~~~
-
-### Java Meta
-
-The meta command prints the metadata about the given ORC file and is
-equivalent to the Hive ORC File Dump command.
-
--j
-  : format the output in JSON
-
--p
-  : pretty print the output
-
--t
-  : print the timezone of the writer
-
---rowindex
-  : print the row indexes for the comma separated list of column ids
-
---recover
-  : skip over corrupted values in the ORC file
-
---skip-dump
-  : skip dumping the metadata
-
---backup-path
-  : when used with --recover specifies the path where the recovered file is 
written
-
-An example of the output is given below:
-
-~~~ shell
-% java -jar orc-tools-X.Y.Z-uber.jar meta examples/TestOrcFile.test1.orc
-Processing data file examples/TestOrcFile.test1.orc [length: 1711]
-Structure for examples/TestOrcFile.test1.orc
-File Version: 0.12 with HIVE_8732
-Rows: 2
-Compression: ZLIB
-Compression size: 10000
-Type: struct<boolean1:boolean,byte1:tinyint,short1:smallint,int1:int,
-long1:bigint,float1:float,double1:double,bytes1:binary,string1:string,
-middle:struct<list:array<struct<int1:int,string1:string>>>,list:array<
-struct<int1:int,string1:string>>,map:map<string,struct<int1:int,string1:
-string>>>
-
-Stripe Statistics:
-  Stripe 1:
-    Column 0: count: 2 hasNull: false
-    Column 1: count: 2 hasNull: false true: 1
-    Column 2: count: 2 hasNull: false min: 1 max: 100 sum: 101
-    Column 3: count: 2 hasNull: false min: 1024 max: 2048 sum: 3072
-    Column 4: count: 2 hasNull: false min: 65536 max: 65536 sum: 131072
-    Column 5: count: 2 hasNull: false min: 9223372036854775807 max: 
9223372036854775807
-    Column 6: count: 2 hasNull: false min: 1.0 max: 2.0 sum: 3.0
-    Column 7: count: 2 hasNull: false min: -15.0 max: -5.0 sum: -20.0
-    Column 8: count: 2 hasNull: false sum: 5
-    Column 9: count: 2 hasNull: false min: bye max: hi sum: 5
-    Column 10: count: 2 hasNull: false
-    Column 11: count: 2 hasNull: false
-    Column 12: count: 4 hasNull: false
-    Column 13: count: 4 hasNull: false min: 1 max: 2 sum: 6
-    Column 14: count: 4 hasNull: false min: bye max: sigh sum: 14
-    Column 15: count: 2 hasNull: false
-    Column 16: count: 5 hasNull: false
-    Column 17: count: 5 hasNull: false min: -100000 max: 100000000 sum: 
99901241
-    Column 18: count: 5 hasNull: false min: bad max: in sum: 15
-    Column 19: count: 2 hasNull: false
-    Column 20: count: 2 hasNull: false min: chani max: mauddib sum: 12
-    Column 21: count: 2 hasNull: false
-    Column 22: count: 2 hasNull: false min: 1 max: 5 sum: 6
-    Column 23: count: 2 hasNull: false min: chani max: mauddib sum: 12
-
-File Statistics:
-  Column 0: count: 2 hasNull: false
-  Column 1: count: 2 hasNull: false true: 1
-  Column 2: count: 2 hasNull: false min: 1 max: 100 sum: 101
-  Column 3: count: 2 hasNull: false min: 1024 max: 2048 sum: 3072
-  Column 4: count: 2 hasNull: false min: 65536 max: 65536 sum: 131072
-  Column 5: count: 2 hasNull: false min: 9223372036854775807 max: 
9223372036854775807
-  Column 6: count: 2 hasNull: false min: 1.0 max: 2.0 sum: 3.0
-  Column 7: count: 2 hasNull: false min: -15.0 max: -5.0 sum: -20.0
-  Column 8: count: 2 hasNull: false sum: 5
-  Column 9: count: 2 hasNull: false min: bye max: hi sum: 5
-  Column 10: count: 2 hasNull: false
-  Column 11: count: 2 hasNull: false
-  Column 12: count: 4 hasNull: false
-  Column 13: count: 4 hasNull: false min: 1 max: 2 sum: 6
-  Column 14: count: 4 hasNull: false min: bye max: sigh sum: 14
-  Column 15: count: 2 hasNull: false
-  Column 16: count: 5 hasNull: false
-  Column 17: count: 5 hasNull: false min: -100000 max: 100000000 sum: 99901241
-  Column 18: count: 5 hasNull: false min: bad max: in sum: 15
-  Column 19: count: 2 hasNull: false
-  Column 20: count: 2 hasNull: false min: chani max: mauddib sum: 12
-  Column 21: count: 2 hasNull: false
-  Column 22: count: 2 hasNull: false min: 1 max: 5 sum: 6
-  Column 23: count: 2 hasNull: false min: chani max: mauddib sum: 12
-
-Stripes:
-  Stripe: offset: 3 data: 243 rows: 2 tail: 199 index: 570
-    Stream: column 0 section ROW_INDEX start: 3 length 11
-    Stream: column 1 section ROW_INDEX start: 14 length 22
-    Stream: column 2 section ROW_INDEX start: 36 length 26
-    Stream: column 3 section ROW_INDEX start: 62 length 27
-    Stream: column 4 section ROW_INDEX start: 89 length 30
-    Stream: column 5 section ROW_INDEX start: 119 length 28
-    Stream: column 6 section ROW_INDEX start: 147 length 34
-    Stream: column 7 section ROW_INDEX start: 181 length 34
-    Stream: column 8 section ROW_INDEX start: 215 length 21
-    Stream: column 9 section ROW_INDEX start: 236 length 30
-    Stream: column 10 section ROW_INDEX start: 266 length 11
-    Stream: column 11 section ROW_INDEX start: 277 length 16
-    Stream: column 12 section ROW_INDEX start: 293 length 11
-    Stream: column 13 section ROW_INDEX start: 304 length 24
-    Stream: column 14 section ROW_INDEX start: 328 length 31
-    Stream: column 15 section ROW_INDEX start: 359 length 16
-    Stream: column 16 section ROW_INDEX start: 375 length 11
-    Stream: column 17 section ROW_INDEX start: 386 length 32
-    Stream: column 18 section ROW_INDEX start: 418 length 30
-    Stream: column 19 section ROW_INDEX start: 448 length 16
-    Stream: column 20 section ROW_INDEX start: 464 length 37
-    Stream: column 21 section ROW_INDEX start: 501 length 11
-    Stream: column 22 section ROW_INDEX start: 512 length 24
-    Stream: column 23 section ROW_INDEX start: 536 length 37
-    Stream: column 1 section DATA start: 573 length 5
-    Stream: column 2 section DATA start: 578 length 6
-    Stream: column 3 section DATA start: 584 length 9
-    Stream: column 4 section DATA start: 593 length 11
-    Stream: column 5 section DATA start: 604 length 12
-    Stream: column 6 section DATA start: 616 length 11
-    Stream: column 7 section DATA start: 627 length 15
-    Stream: column 8 section DATA start: 642 length 8
-    Stream: column 8 section LENGTH start: 650 length 6
-    Stream: column 9 section DATA start: 656 length 8
-    Stream: column 9 section LENGTH start: 664 length 6
-    Stream: column 11 section LENGTH start: 670 length 6
-    Stream: column 13 section DATA start: 676 length 7
-    Stream: column 14 section DATA start: 683 length 6
-    Stream: column 14 section LENGTH start: 689 length 6
-    Stream: column 14 section DICTIONARY_DATA start: 695 length 10
-    Stream: column 15 section LENGTH start: 705 length 6
-    Stream: column 17 section DATA start: 711 length 25
-    Stream: column 18 section DATA start: 736 length 18
-    Stream: column 18 section LENGTH start: 754 length 8
-    Stream: column 19 section LENGTH start: 762 length 6
-    Stream: column 20 section DATA start: 768 length 15
-    Stream: column 20 section LENGTH start: 783 length 6
-    Stream: column 22 section DATA start: 789 length 6
-    Stream: column 23 section DATA start: 795 length 15
-    Stream: column 23 section LENGTH start: 810 length 6
-    Encoding column 0: DIRECT
-    Encoding column 1: DIRECT
-    Encoding column 2: DIRECT
-    Encoding column 3: DIRECT_V2
-    Encoding column 4: DIRECT_V2
-    Encoding column 5: DIRECT_V2
-    Encoding column 6: DIRECT
-    Encoding column 7: DIRECT
-    Encoding column 8: DIRECT_V2
-    Encoding column 9: DIRECT_V2
-    Encoding column 10: DIRECT
-    Encoding column 11: DIRECT_V2
-    Encoding column 12: DIRECT
-    Encoding column 13: DIRECT_V2
-    Encoding column 14: DICTIONARY_V2[2]
-    Encoding column 15: DIRECT_V2
-    Encoding column 16: DIRECT
-    Encoding column 17: DIRECT_V2
-    Encoding column 18: DIRECT_V2
-    Encoding column 19: DIRECT_V2
-    Encoding column 20: DIRECT_V2
-    Encoding column 21: DIRECT
-    Encoding column 22: DIRECT_V2
-    Encoding column 23: DIRECT_V2
-
-File length: 1711 bytes
-Padding length: 0 bytes
-Padding ratio: 0%
-______________________________________________________________________
-~~~
-
-### Java Data
-
-The data command prints the data in an ORC file as a JSON document. Each
-record is printed as a JSON object on a line. Each record is annotated with
-the fieldnames and a JSON representation that depends on the field's type.
-
-### Java Scan
-
-The scan command reads the contents of the file without printing anything. It
-is primarily intendend for benchmarking the Java reader without including the
-cost of printing the data out.
-
-### Java Convert
-
-The convert command reads several JSON files and converts them into a
-single ORC file.
-
--o <filename>
-  : Sets the output ORC filename, which defaults to output.orc
-
--s <schema>
-  : Sets the schema for the ORC file. By default, the schema is automatically 
discovered.
-
--h
-  : Print help
-  
-The automatic JSON schema discovery is equivalent to the json-schema tool
-below.
-
-### Java JSON Schema
-
-The JSON Schema discovery tool processes a set of JSON documents and
-produces a schema that encompasses all of the records in all of the
-documents. It works by computing the enclosing type and promoting it
-to include all of the observed values.
-
--f
-  : Print the schema as a list of flat types for each subfield
-
--t
-  : Print the schema as a Hive table declaration
-
--h
-  : Print help
\ No newline at end of file

orc git commit: Add docs for C++ tools and core API.

Reply via email to