[GOBBLIN-351] Add ParquetHdfsDataWriter docs [GOBBLIN-351] Add ParquetHdfsDataWriter docs
[GOBBLIN-351] Add more info about builder and dictionary encoding Closes #2220 from tilakpatidar/parquet_hdfs_docs Project: http://git-wip-us.apache.org/repos/asf/incubator-gobblin/repo Commit: http://git-wip-us.apache.org/repos/asf/incubator-gobblin/commit/3598d10e Tree: http://git-wip-us.apache.org/repos/asf/incubator-gobblin/tree/3598d10e Diff: http://git-wip-us.apache.org/repos/asf/incubator-gobblin/diff/3598d10e Branch: refs/heads/0.12.0 Commit: 3598d10eb0ea0d01244a93ff1506a563afeca9ed Parents: 3094fe5 Author: tilakpatidar <[email protected]> Authored: Mon Feb 5 12:03:31 2018 -0800 Committer: Abhishek Tiwari <[email protected]> Committed: Mon Feb 5 12:03:31 2018 -0800 ---------------------------------------------------------------------- gobblin-docs/sinks/ParquetHdfsDataWriter.md | 25 ++++++++++++++++++++++++ mkdocs.yml | 1 + 2 files changed, 26 insertions(+) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/incubator-gobblin/blob/3598d10e/gobblin-docs/sinks/ParquetHdfsDataWriter.md ---------------------------------------------------------------------- diff --git a/gobblin-docs/sinks/ParquetHdfsDataWriter.md b/gobblin-docs/sinks/ParquetHdfsDataWriter.md new file mode 100644 index 0000000..f3ad0da --- /dev/null +++ b/gobblin-docs/sinks/ParquetHdfsDataWriter.md @@ -0,0 +1,25 @@ +# Description + +An extension to [`FsDataWriter`](https://github.com/apache/incubator-gobblin/blob/master/gobblin-core/src/main/java/org/apache/gobblin/writer/FsDataWriter.java) that writes in Parquet format in the form of [`Group.java`](https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/example/data/Group.java). This implementation allows users to specify the CodecFactory to use through the configuration property [`writer.codec.type`](https://gobblin.readthedocs.io/en/latest/user-guide/Configuration-Properties-Glossary/#writercodectype). By default, the deflate codec is used. + +# Usage +``` +writer.builder.class=org.apache.gobblin.writer.ParquetDataWriterBuilder +writer.destination.type=HDFS +writer.output.format=PARQUET +``` +For more info, see +[`ParquetHdfsDataWriter`](https://github.com/apache/incubator-gobblin/blob/master/gobblin-modules/gobblin-parquet/src/main/java/org/apache/gobblin/writer/ParquetHdfsDataWriter.java) +and +[`ParquetDataWriterBuilder`](https://github.com/apache/incubator-gobblin/blob/master/gobblin-modules/gobblin-parquet/src/main/java/org/apache/gobblin/writer/ParquetDataWriterBuilder.java) + + +# Configuration + +| Key | Description | Default Value | Required | +|------------------------|-------------|---------------|----------| +| writer.parquet.page.size | The page size threshold. | 1048576 | No | +| writer.parquet.dictionary.page.size | The block size threshold for the dictionary pages. | 134217728 | No | +| writer.parquet.dictionary | To turn dictionary encoding on. Parquet has a dictionary encoding for data with a small number of unique values ( < 10^5 ) that aids in significant compression and boosts processing speed. | true | No | +| writer.parquet.validate | To turn on validation using the schema. This validation is done by [`ParquetWriter`](https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetWriter.java) not by Gobblin. | false | No | +| writer.parquet.version | Version of parquet writer to use. Available versions are v1 and v2. | v1 | No | \ No newline at end of file http://git-wip-us.apache.org/repos/asf/incubator-gobblin/blob/3598d10e/mkdocs.yml ---------------------------------------------------------------------- diff --git a/mkdocs.yml b/mkdocs.yml index f3486b4..7152bd0 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -64,6 +64,7 @@ pages: - Wikipedia: sources/WikipediaSource.md - Record Sinks: - Avro HDFS: sinks/AvroHdfsDataWriter.md + - Parquet HDFS: sinks/ParquetHdfsDataWriter.md - HDFS Byte array: sinks/SimpleBytesWriter.md - Console: sinks/ConsoleWriter.md - Couchbase: sinks/Couchbase-Writer.md
