This is an automated email from the ASF dual-hosted git repository.
vinoth pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git
The following commit(s) were added to refs/heads/asf-site by this push:
new aee584d [HUDI-2670] - relative links broken in docs (#3907)
aee584d is described below
commit aee584dc6ee4d2691225518954aad03af68eb7ff
Author: Kyle Weller <[email protected]>
AuthorDate: Tue Nov 2 21:49:09 2021 -0700
[HUDI-2670] - relative links broken in docs (#3907)
* added new docs to current version to fix broken relative links
---
.../version-0.9.0/hoodie_deltastreamer.md | 211 +++++++++++++++++++++
.../version-0.9.0/query_engine_setup.md | 46 +++++
.../versioned_docs/version-0.9.0/table_types.md | 7 +
3 files changed, 264 insertions(+)
diff --git a/website/versioned_docs/version-0.9.0/hoodie_deltastreamer.md
b/website/versioned_docs/version-0.9.0/hoodie_deltastreamer.md
new file mode 100644
index 0000000..a97f1cb
--- /dev/null
+++ b/website/versioned_docs/version-0.9.0/hoodie_deltastreamer.md
@@ -0,0 +1,211 @@
+---
+title: Streaming Ingestion
+keywords: [hudi, deltastreamer, hoodiedeltastreamer]
+---
+
+## DeltaStreamer
+
+The `HoodieDeltaStreamer` utility (part of hudi-utilities-bundle) provides the
way to ingest from different sources such as DFS or Kafka, with the following
capabilities.
+
+- Exactly once ingestion of new events from Kafka, [incremental
imports](https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide#_incremental_imports)
from Sqoop or output of `HiveIncrementalPuller` or files under a DFS folder
+- Support json, avro or a custom record types for the incoming data
+- Manage checkpoints, rollback & recovery
+- Leverage Avro schemas from DFS or Confluent [schema
registry](https://github.com/confluentinc/schema-registry).
+- Support for plugging in transformations
+
+Command line options describe capabilities in more detail
+
+```java
+[hoodie]$ spark-submit --class
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer `ls
packaging/hudi-utilities-bundle/target/hudi-utilities-bundle-*.jar` --help
+Usage: <main class> [options]
+Options:
+ --checkpoint
+ Resume Delta Streamer from this checkpoint.
+ --commit-on-errors
+ Commit even when some records failed to be written
+ Default: false
+ --compact-scheduling-minshare
+ Minshare for compaction as defined in
+ https://spark.apache.org/docs/latest/job-scheduling
+ Default: 0
+ --compact-scheduling-weight
+ Scheduling weight for compaction as defined in
+ https://spark.apache.org/docs/latest/job-scheduling
+ Default: 1
+ --continuous
+ Delta Streamer runs in continuous mode running source-fetch -> Transform
+ -> Hudi Write in loop
+ Default: false
+ --delta-sync-scheduling-minshare
+ Minshare for delta sync as defined in
+ https://spark.apache.org/docs/latest/job-scheduling
+ Default: 0
+ --delta-sync-scheduling-weight
+ Scheduling weight for delta sync as defined in
+ https://spark.apache.org/docs/latest/job-scheduling
+ Default: 1
+ --disable-compaction
+ Compaction is enabled for MoR table by default. This flag disables it
+ Default: false
+ --enable-hive-sync
+ Enable syncing to hive
+ Default: false
+ --filter-dupes
+ Should duplicate records from source be dropped/filtered out before
+ insert/bulk-insert
+ Default: false
+ --help, -h
+
+ --hoodie-conf
+ Any configuration that can be set in the properties file (using the CLI
+ parameter "--propsFilePath") can also be passed command line using this
+ parameter
+ Default: []
+ --max-pending-compactions
+ Maximum number of outstanding inflight/requested compactions. Delta Sync
+ will not happen unlessoutstanding compactions is less than this number
+ Default: 5
+ --min-sync-interval-seconds
+ the min sync interval of each sync in continuous mode
+ Default: 0
+ --op
+ Takes one of these values : UPSERT (default), INSERT (use when input is
+ purely new data/inserts to gain speed)
+ Default: UPSERT
+ Possible Values: [UPSERT, INSERT, BULK_INSERT]
+ --payload-class
+ subclass of HoodieRecordPayload, that works off a GenericRecord.
+ Implement your own, if you want to do something other than overwriting
+ existing value
+ Default: org.apache.hudi.common.model.OverwriteWithLatestAvroPayload
+ --props
+ path to properties file on localfs or dfs, with configurations for
+ hoodie client, schema provider, key generator and data source. For
+ hoodie client props, sane defaults are used, but recommend use to
+ provide basic things like metrics endpoints, hive configs etc. For
+ sources, referto individual classes, for supported properties.
+ Default:
file:///Users/vinoth/bin/hoodie/src/test/resources/delta-streamer-config/dfs-source.properties
+ --schemaprovider-class
+ subclass of org.apache.hudi.utilities.schema.SchemaProvider to attach
+ schemas to input & target table data, built in options:
+ org.apache.hudi.utilities.schema.FilebasedSchemaProvider.Source (See
+ org.apache.hudi.utilities.sources.Source) implementation can implement
+ their own SchemaProvider. For Sources that return Dataset<Row>, the
+ schema is obtained implicitly. However, this CLI option allows
+ overriding the schemaprovider returned by Source.
+ --source-class
+ Subclass of org.apache.hudi.utilities.sources to read data. Built-in
+ options: org.apache.hudi.utilities.sources.{JsonDFSSource (default),
+ AvroDFSSource, AvroKafkaSource, CsvDFSSource, HiveIncrPullSource,
+ JdbcSource, JsonKafkaSource, ORCDFSSource, ParquetDFSSource,
+ S3EventsHoodieIncrSource, S3EventsSource, SqlSource}
+ Default: org.apache.hudi.utilities.sources.JsonDFSSource
+ --source-limit
+ Maximum amount of data to read from source. Default: No limit For e.g:
+ DFS-Source => max bytes to read, Kafka-Source => max events to read
+ Default: 9223372036854775807
+ --source-ordering-field
+ Field within source record to decide how to break ties between records
+ with same key in input data. Default: 'ts' holding unix timestamp of
+ record
+ Default: ts
+ --spark-master
+ spark master to use.
+ Default: local[2]
+ * --table-type
+ Type of table. COPY_ON_WRITE (or) MERGE_ON_READ
+ * --target-base-path
+ base path for the target hoodie table. (Will be created if did not exist
+ first time around. If exists, expected to be a hoodie table)
+ * --target-table
+ name of the target table in Hive
+ --transformer-class
+ subclass of org.apache.hudi.utilities.transform.Transformer. Allows
+ transforming raw source Dataset to a target Dataset (conforming to
+ target schema) before writing. Default : Not set. E:g -
+ org.apache.hudi.utilities.transform.SqlQueryBasedTransformer (which
+ allows a SQL query templated to be passed as a transformation function)
+```
+
+The tool takes a hierarchically composed property file and has pluggable
interfaces for extracting data, key generation and providing schema. Sample
configs for ingesting from kafka and dfs are
+provided under `hudi-utilities/src/test/resources/delta-streamer-config`.
+
+For e.g: once you have Confluent Kafka, Schema registry up & running, produce
some test data using
([impressions.avro](https://docs.confluent.io/current/ksql/docs/tutorials/generate-custom-test-data)
provided by schema-registry repo)
+
+```java
+[confluent-5.0.0]$ bin/ksql-datagen schema=../impressions.avro format=avro
topic=impressions key=impressionid
+```
+
+and then ingest it as follows.
+
+```java
+[hoodie]$ spark-submit --class
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer `ls
packaging/hudi-utilities-bundle/target/hudi-utilities-bundle-*.jar` \
+ --props
file://${PWD}/hudi-utilities/src/test/resources/delta-streamer-config/kafka-source.properties
\
+ --schemaprovider-class
org.apache.hudi.utilities.schema.SchemaRegistryProvider \
+ --source-class org.apache.hudi.utilities.sources.AvroKafkaSource \
+ --source-ordering-field impresssiontime \
+ --target-base-path file:\/\/\/tmp/hudi-deltastreamer-op \
+ --target-table uber.impressions \
+ --op BULK_INSERT
+```
+
+In some cases, you may want to migrate your existing table into Hudi
beforehand. Please refer to [migration guide](/docs/migration_guide).
+
+## MultiTableDeltaStreamer
+
+`HoodieMultiTableDeltaStreamer`, a wrapper on top of `HoodieDeltaStreamer`,
enables one to ingest multiple tables at a single go into hudi datasets.
Currently it only supports sequential processing of tables to be ingested and
COPY_ON_WRITE storage type. The command line options for
`HoodieMultiTableDeltaStreamer` are pretty much similar to
`HoodieDeltaStreamer` with the only exception that you are required to provide
table wise configs in separate files in a dedicated config folder. The [...]
+
+```java
+ * --config-folder
+ the path to the folder which contains all the table wise config files
+ --base-path-prefix
+ this is added to enable users to create all the hudi datasets for related
tables under one path in FS. The datasets are then created under the path -
<base_path_prefix>/<database>/<table_to_be_ingested>. However you can override
the paths for every table by setting the property
hoodie.deltastreamer.ingestion.targetBasePath
+```
+
+The following properties are needed to be set properly to ingest data using
`HoodieMultiTableDeltaStreamer`.
+
+```java
+hoodie.deltastreamer.ingestion.tablesToBeIngested
+ comma separated names of tables to be ingested in the format
<database>.<table>, for example db1.table1,db1.table2
+hoodie.deltastreamer.ingestion.targetBasePath
+ if you wish to ingest a particular table in a separate path, you can mention
that path here
+hoodie.deltastreamer.ingestion.<database>.<table>.configFile
+ path to the config file in dedicated config folder which contains table
overridden properties for the particular table to be ingested.
+```
+
+Sample config files for table wise overridden properties can be found under
`hudi-utilities/src/test/resources/delta-streamer-config`. The command to run
`HoodieMultiTableDeltaStreamer` is also similar to how you run
`HoodieDeltaStreamer`.
+
+```java
+[hoodie]$ spark-submit --class
org.apache.hudi.utilities.deltastreamer.HoodieMultiTableDeltaStreamer `ls
packaging/hudi-utilities-bundle/target/hudi-utilities-bundle-*.jar` \
+ --props
file://${PWD}/hudi-utilities/src/test/resources/delta-streamer-config/kafka-source.properties
\
+ --config-folder file://tmp/hudi-ingestion-config \
+ --schemaprovider-class
org.apache.hudi.utilities.schema.SchemaRegistryProvider \
+ --source-class org.apache.hudi.utilities.sources.AvroKafkaSource \
+ --source-ordering-field impresssiontime \
+ --base-path-prefix file:\/\/\/tmp/hudi-deltastreamer-op \
+ --target-table uber.impressions \
+ --op BULK_INSERT
+```
+
+For detailed information on how to configure and use
`HoodieMultiTableDeltaStreamer`, please refer [blog
section](/blog/2020/08/22/ingest-multiple-tables-using-hudi).
+
+## Concurrency Control
+
+The `HoodieDeltaStreamer` utility (part of hudi-utilities-bundle) provides
ways to ingest from different sources such as DFS or Kafka, with the following
capabilities.
+
+Using optimistic_concurrency_control via delta streamer requires adding the
above configs to the properties file that can be passed to the
+job. For example below, adding the configs to kafka-source.properties file and
passing them to deltastreamer will enable optimistic concurrency.
+A deltastreamer job can then be triggered as follows:
+
+```java
+[hoodie]$ spark-submit --class
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer `ls
packaging/hudi-utilities-bundle/target/hudi-utilities-bundle-*.jar` \
+ --props
file://${PWD}/hudi-utilities/src/test/resources/delta-streamer-config/kafka-source.properties
\
+ --schemaprovider-class
org.apache.hudi.utilities.schema.SchemaRegistryProvider \
+ --source-class org.apache.hudi.utilities.sources.AvroKafkaSource \
+ --source-ordering-field impresssiontime \
+ --target-base-path file:\/\/\/tmp/hudi-deltastreamer-op \
+ --target-table uber.impressions \
+ --op BULK_INSERT
+```
+
+Read more in depth about concurrency control in the [concurrency control
concepts](/docs/concurrency_control) section
diff --git a/website/versioned_docs/version-0.9.0/query_engine_setup.md
b/website/versioned_docs/version-0.9.0/query_engine_setup.md
new file mode 100644
index 0000000..99887b5
--- /dev/null
+++ b/website/versioned_docs/version-0.9.0/query_engine_setup.md
@@ -0,0 +1,46 @@
+---
+title: Query Engine Setup
+summary: "In this page, we describe how to setup various query engines for
Hudi."
+toc: true
+last_modified_at:
+---
+
+## Spark
+The Spark Datasource API is a popular way of authoring Spark ETL pipelines.
Hudi tables can be queried via the Spark datasource with a simple
`spark.read.parquet`.
+See the [Spark Quick Start](/docs/quick-start-guide) for more examples of
Spark datasource reading queries.
+
+If your Spark environment does not have the Hudi jars installed, add `--jars
<path to jar>/hudi-spark-bundle_2.11-<hudi version>.jar` to the classpath of
drivers
+and executors. Alternatively, hudi-spark-bundle can also fetched via the
`--packages` options (e.g: `--packages
org.apache.hudi:hudi-spark-bundle_2.11:0.5.3`).
+
+## PrestoDB
+PrestoDB is a popular query engine, providing interactive query performance.
PrestoDB currently supports snapshot querying on COPY_ON_WRITE tables.
+Both snapshot and read optimized queries are supported on MERGE_ON_READ Hudi
tables. Since PrestoDB-Hudi integration has evolved over time, the installation
+instructions for PrestoDB would vary based on versions. Please check the below
table for query types supported and installation instructions
+for different versions of PrestoDB.
+
+
+| **PrestoDB Version** | **Installation description** | **Query types
supported** |
+|----------------------|------------------------------|---------------------------|
+| < 0.233 | Requires the `hudi-presto-bundle` jar to be placed
into `<presto_install>/plugin/hive-hadoop2/`, across the installation. |
Snapshot querying on COW tables. Read optimized querying on MOR tables. |
+| >= 0.233 | No action needed. Hudi (0.5.1-incubating) is a
compile time dependency. | Snapshot querying on COW tables. Read optimized
querying on MOR tables. |
+| >= 0.240 | No action needed. Hudi 0.5.3 version is a compile
time dependency. | Snapshot querying on both COW and MOR tables |
+
+## Trino
+:::note
+[Trino](https://trino.io/) (formerly PrestoSQL) was forked off of PrestoDB a
few years ago. Hudi supports 'Snapshot' queries for Copy-On-Write tables and
'Read Optimized' queries
+for Merge-On-Read tables. This is through the initial input format based
integration in PrestoDB (pre forking). This approach has
+known performance limitations with very large tables, which has been since
fixed on PrestoDB. We are working on replicating the same fixes on Trino as
well.
+:::
+
+To query Hudi tables on Trino, please place the `hudi-presto-bundle` jar into
the Hive connector installation `<trino_install>/plugin/hive-hadoop2`.
+
+## Hive
+
+In order for Hive to recognize Hudi tables and query correctly,
+- the HiveServer2 needs to be provided with the
`hudi-hadoop-mr-bundle-x.y.z-SNAPSHOT.jar` in its [aux jars
path](https://www.cloudera.com/documentation/enterprise/5-6-x/topics/cm_mc_hive_udf#concept_nc3_mms_lr).
This will ensure the input format
+ classes with its dependencies are available for query planning & execution.
+- For MERGE_ON_READ tables, additionally the bundle needs to be put on the
hadoop/hive installation across the cluster, so that queries can pick up the
custom RecordReader as well.
+
+In addition to setup above, for beeline cli access, the `hive.input.format`
variable needs to be set to the fully qualified path name of the
+inputformat `org.apache.hudi.hadoop.HoodieParquetInputFormat`. For Tez,
additionally the `hive.tez.input.format` needs to be set
+to `org.apache.hadoop.hive.ql.io.HiveInputFormat`. Then proceed to query the
table like any other Hive table.
diff --git a/website/versioned_docs/version-0.9.0/table_types.md
b/website/versioned_docs/version-0.9.0/table_types.md
new file mode 100644
index 0000000..1f0b767
--- /dev/null
+++ b/website/versioned_docs/version-0.9.0/table_types.md
@@ -0,0 +1,7 @@
+---
+title: Table Types
+summary: "In this page, we describe the different tables types in Hudi."
+toc: true
+last_modified_at:
+---
+