[hudi] branch asf-site updated: [HUDI-2670] - relative links broken in docs (#3907)

vinoth Tue, 02 Nov 2021 21:49:29 -0700

This is an automated email from the ASF dual-hosted git repository.

vinoth pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git



The following commit(s) were added to refs/heads/asf-site by this push:
     new aee584d  [HUDI-2670] - relative links broken in docs (#3907)
aee584d is described below

commit aee584dc6ee4d2691225518954aad03af68eb7ff
Author: Kyle Weller <[email protected]>
AuthorDate: Tue Nov 2 21:49:09 2021 -0700

    [HUDI-2670] - relative links broken in docs (#3907)
    
    
    * added new docs to current version to fix broken relative links
---
 .../version-0.9.0/hoodie_deltastreamer.md          | 211 +++++++++++++++++++++
 .../version-0.9.0/query_engine_setup.md            |  46 +++++
 .../versioned_docs/version-0.9.0/table_types.md    |   7 +
 3 files changed, 264 insertions(+)

diff --git a/website/versioned_docs/version-0.9.0/hoodie_deltastreamer.md 
b/website/versioned_docs/version-0.9.0/hoodie_deltastreamer.md
new file mode 100644
index 0000000..a97f1cb
--- /dev/null
+++ b/website/versioned_docs/version-0.9.0/hoodie_deltastreamer.md
@@ -0,0 +1,211 @@
+---
+title: Streaming Ingestion
+keywords: [hudi, deltastreamer, hoodiedeltastreamer]
+---
+
+## DeltaStreamer
+
+The `HoodieDeltaStreamer` utility (part of hudi-utilities-bundle) provides the 
way to ingest from different sources such as DFS or Kafka, with the following 
capabilities.
+
+- Exactly once ingestion of new events from Kafka, [incremental 
imports](https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide#_incremental_imports)
 from Sqoop or output of `HiveIncrementalPuller` or files under a DFS folder
+- Support json, avro or a custom record types for the incoming data
+- Manage checkpoints, rollback & recovery
+- Leverage Avro schemas from DFS or Confluent [schema 
registry](https://github.com/confluentinc/schema-registry).
+- Support for plugging in transformations
+
+Command line options describe capabilities in more detail
+
+```java
+[hoodie]$ spark-submit --class 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer `ls 
packaging/hudi-utilities-bundle/target/hudi-utilities-bundle-*.jar` --help
+Usage: <main class> [options]
+Options:
+    --checkpoint
+      Resume Delta Streamer from this checkpoint.
+    --commit-on-errors
+      Commit even when some records failed to be written
+      Default: false
+    --compact-scheduling-minshare
+      Minshare for compaction as defined in
+      https://spark.apache.org/docs/latest/job-scheduling
+      Default: 0
+    --compact-scheduling-weight
+      Scheduling weight for compaction as defined in
+      https://spark.apache.org/docs/latest/job-scheduling
+      Default: 1
+    --continuous
+      Delta Streamer runs in continuous mode running source-fetch -> Transform
+      -> Hudi Write in loop
+      Default: false
+    --delta-sync-scheduling-minshare
+      Minshare for delta sync as defined in
+      https://spark.apache.org/docs/latest/job-scheduling
+      Default: 0
+    --delta-sync-scheduling-weight
+      Scheduling weight for delta sync as defined in
+      https://spark.apache.org/docs/latest/job-scheduling
+      Default: 1
+    --disable-compaction
+      Compaction is enabled for MoR table by default. This flag disables it
+      Default: false
+    --enable-hive-sync
+      Enable syncing to hive
+      Default: false
+    --filter-dupes
+      Should duplicate records from source be dropped/filtered out before
+      insert/bulk-insert
+      Default: false
+    --help, -h
+
+    --hoodie-conf
+      Any configuration that can be set in the properties file (using the CLI
+      parameter "--propsFilePath") can also be passed command line using this
+      parameter
+      Default: []
+    --max-pending-compactions
+      Maximum number of outstanding inflight/requested compactions. Delta Sync
+      will not happen unlessoutstanding compactions is less than this number
+      Default: 5
+    --min-sync-interval-seconds
+      the min sync interval of each sync in continuous mode
+      Default: 0
+    --op
+      Takes one of these values : UPSERT (default), INSERT (use when input is
+      purely new data/inserts to gain speed)
+      Default: UPSERT
+      Possible Values: [UPSERT, INSERT, BULK_INSERT]
+    --payload-class
+      subclass of HoodieRecordPayload, that works off a GenericRecord.
+      Implement your own, if you want to do something other than overwriting
+      existing value
+      Default: org.apache.hudi.common.model.OverwriteWithLatestAvroPayload
+    --props
+      path to properties file on localfs or dfs, with configurations for
+      hoodie client, schema provider, key generator and data source. For
+      hoodie client props, sane defaults are used, but recommend use to
+      provide basic things like metrics endpoints, hive configs etc. For
+      sources, referto individual classes, for supported properties.
+      Default: 
file:///Users/vinoth/bin/hoodie/src/test/resources/delta-streamer-config/dfs-source.properties
+    --schemaprovider-class
+      subclass of org.apache.hudi.utilities.schema.SchemaProvider to attach
+      schemas to input & target table data, built in options:
+      org.apache.hudi.utilities.schema.FilebasedSchemaProvider.Source (See
+      org.apache.hudi.utilities.sources.Source) implementation can implement
+      their own SchemaProvider. For Sources that return Dataset<Row>, the
+      schema is obtained implicitly. However, this CLI option allows
+      overriding the schemaprovider returned by Source.
+    --source-class
+      Subclass of org.apache.hudi.utilities.sources to read data. Built-in
+      options: org.apache.hudi.utilities.sources.{JsonDFSSource (default), 
+      AvroDFSSource, AvroKafkaSource, CsvDFSSource, HiveIncrPullSource, 
+      JdbcSource, JsonKafkaSource, ORCDFSSource, ParquetDFSSource, 
+      S3EventsHoodieIncrSource, S3EventsSource, SqlSource}
+      Default: org.apache.hudi.utilities.sources.JsonDFSSource
+    --source-limit
+      Maximum amount of data to read from source. Default: No limit For e.g:
+      DFS-Source => max bytes to read, Kafka-Source => max events to read
+      Default: 9223372036854775807
+    --source-ordering-field
+      Field within source record to decide how to break ties between records
+      with same key in input data. Default: 'ts' holding unix timestamp of
+      record
+      Default: ts
+    --spark-master
+      spark master to use.
+      Default: local[2]
+  * --table-type
+      Type of table. COPY_ON_WRITE (or) MERGE_ON_READ
+  * --target-base-path
+      base path for the target hoodie table. (Will be created if did not exist
+      first time around. If exists, expected to be a hoodie table)
+  * --target-table
+      name of the target table in Hive
+    --transformer-class
+      subclass of org.apache.hudi.utilities.transform.Transformer. Allows
+      transforming raw source Dataset to a target Dataset (conforming to
+      target schema) before writing. Default : Not set. E:g -
+      org.apache.hudi.utilities.transform.SqlQueryBasedTransformer (which
+      allows a SQL query templated to be passed as a transformation function)
+```
+
+The tool takes a hierarchically composed property file and has pluggable 
interfaces for extracting data, key generation and providing schema. Sample 
configs for ingesting from kafka and dfs are
+provided under `hudi-utilities/src/test/resources/delta-streamer-config`.
+
+For e.g: once you have Confluent Kafka, Schema registry up & running, produce 
some test data using 
([impressions.avro](https://docs.confluent.io/current/ksql/docs/tutorials/generate-custom-test-data)
 provided by schema-registry repo)
+
+```java
+[confluent-5.0.0]$ bin/ksql-datagen schema=../impressions.avro format=avro 
topic=impressions key=impressionid
+```
+
+and then ingest it as follows.
+
+```java
+[hoodie]$ spark-submit --class 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer `ls 
packaging/hudi-utilities-bundle/target/hudi-utilities-bundle-*.jar` \
+  --props 
file://${PWD}/hudi-utilities/src/test/resources/delta-streamer-config/kafka-source.properties
 \
+  --schemaprovider-class 
org.apache.hudi.utilities.schema.SchemaRegistryProvider \
+  --source-class org.apache.hudi.utilities.sources.AvroKafkaSource \
+  --source-ordering-field impresssiontime \
+  --target-base-path file:\/\/\/tmp/hudi-deltastreamer-op \ 
+  --target-table uber.impressions \
+  --op BULK_INSERT
+```
+
+In some cases, you may want to migrate your existing table into Hudi 
beforehand. Please refer to [migration guide](/docs/migration_guide).
+
+## MultiTableDeltaStreamer
+
+`HoodieMultiTableDeltaStreamer`, a wrapper on top of `HoodieDeltaStreamer`, 
enables one to ingest multiple tables at a single go into hudi datasets. 
Currently it only supports sequential processing of tables to be ingested and 
COPY_ON_WRITE storage type. The command line options for 
`HoodieMultiTableDeltaStreamer` are pretty much similar to 
`HoodieDeltaStreamer` with the only exception that you are required to provide 
table wise configs in separate files in a dedicated config folder. The [...]
+
+```java
+  * --config-folder
+    the path to the folder which contains all the table wise config files
+    --base-path-prefix
+    this is added to enable users to create all the hudi datasets for related 
tables under one path in FS. The datasets are then created under the path - 
<base_path_prefix>/<database>/<table_to_be_ingested>. However you can override 
the paths for every table by setting the property 
hoodie.deltastreamer.ingestion.targetBasePath
+```
+
+The following properties are needed to be set properly to ingest data using 
`HoodieMultiTableDeltaStreamer`.
+
+```java
+hoodie.deltastreamer.ingestion.tablesToBeIngested
+  comma separated names of tables to be ingested in the format 
<database>.<table>, for example db1.table1,db1.table2
+hoodie.deltastreamer.ingestion.targetBasePath
+  if you wish to ingest a particular table in a separate path, you can mention 
that path here
+hoodie.deltastreamer.ingestion.<database>.<table>.configFile
+  path to the config file in dedicated config folder which contains table 
overridden properties for the particular table to be ingested.
+```
+
+Sample config files for table wise overridden properties can be found under 
`hudi-utilities/src/test/resources/delta-streamer-config`. The command to run 
`HoodieMultiTableDeltaStreamer` is also similar to how you run 
`HoodieDeltaStreamer`.
+
+```java
+[hoodie]$ spark-submit --class 
org.apache.hudi.utilities.deltastreamer.HoodieMultiTableDeltaStreamer `ls 
packaging/hudi-utilities-bundle/target/hudi-utilities-bundle-*.jar` \
+  --props 
file://${PWD}/hudi-utilities/src/test/resources/delta-streamer-config/kafka-source.properties
 \
+  --config-folder file://tmp/hudi-ingestion-config \
+  --schemaprovider-class 
org.apache.hudi.utilities.schema.SchemaRegistryProvider \
+  --source-class org.apache.hudi.utilities.sources.AvroKafkaSource \
+  --source-ordering-field impresssiontime \
+  --base-path-prefix file:\/\/\/tmp/hudi-deltastreamer-op \ 
+  --target-table uber.impressions \
+  --op BULK_INSERT
+```
+
+For detailed information on how to configure and use 
`HoodieMultiTableDeltaStreamer`, please refer [blog 
section](/blog/2020/08/22/ingest-multiple-tables-using-hudi).
+
+## Concurrency Control
+
+The `HoodieDeltaStreamer` utility (part of hudi-utilities-bundle) provides 
ways to ingest from different sources such as DFS or Kafka, with the following 
capabilities.
+
+Using optimistic_concurrency_control via delta streamer requires adding the 
above configs to the properties file that can be passed to the
+job. For example below, adding the configs to kafka-source.properties file and 
passing them to deltastreamer will enable optimistic concurrency.
+A deltastreamer job can then be triggered as follows:
+
+```java
+[hoodie]$ spark-submit --class 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer `ls 
packaging/hudi-utilities-bundle/target/hudi-utilities-bundle-*.jar` \
+  --props 
file://${PWD}/hudi-utilities/src/test/resources/delta-streamer-config/kafka-source.properties
 \
+  --schemaprovider-class 
org.apache.hudi.utilities.schema.SchemaRegistryProvider \
+  --source-class org.apache.hudi.utilities.sources.AvroKafkaSource \
+  --source-ordering-field impresssiontime \
+  --target-base-path file:\/\/\/tmp/hudi-deltastreamer-op \ 
+  --target-table uber.impressions \
+  --op BULK_INSERT
+```
+
+Read more in depth about concurrency control in the [concurrency control 
concepts](/docs/concurrency_control) section
diff --git a/website/versioned_docs/version-0.9.0/query_engine_setup.md 
b/website/versioned_docs/version-0.9.0/query_engine_setup.md
new file mode 100644
index 0000000..99887b5
--- /dev/null
+++ b/website/versioned_docs/version-0.9.0/query_engine_setup.md
@@ -0,0 +1,46 @@
+---
+title: Query Engine Setup
+summary: "In this page, we describe how to setup various query engines for 
Hudi."
+toc: true
+last_modified_at:
+---
+
+## Spark
+The Spark Datasource API is a popular way of authoring Spark ETL pipelines. 
Hudi tables can be queried via the Spark datasource with a simple 
`spark.read.parquet`.
+See the [Spark Quick Start](/docs/quick-start-guide) for more examples of 
Spark datasource reading queries.
+
+If your Spark environment does not have the Hudi jars installed, add `--jars 
<path to jar>/hudi-spark-bundle_2.11-<hudi version>.jar` to the classpath of 
drivers
+and executors. Alternatively, hudi-spark-bundle can also fetched via the 
`--packages` options (e.g: `--packages 
org.apache.hudi:hudi-spark-bundle_2.11:0.5.3`).
+
+## PrestoDB
+PrestoDB is a popular query engine, providing interactive query performance. 
PrestoDB currently supports snapshot querying on COPY_ON_WRITE tables.
+Both snapshot and read optimized queries are supported on MERGE_ON_READ Hudi 
tables. Since PrestoDB-Hudi integration has evolved over time, the installation
+instructions for PrestoDB would vary based on versions. Please check the below 
table for query types supported and installation instructions
+for different versions of PrestoDB.
+
+
+| **PrestoDB Version** | **Installation description** | **Query types 
supported** |
+|----------------------|------------------------------|---------------------------|
+| < 0.233              | Requires the `hudi-presto-bundle` jar to be placed 
into `<presto_install>/plugin/hive-hadoop2/`, across the installation. | 
Snapshot querying on COW tables. Read optimized querying on MOR tables. |
+| >= 0.233             | No action needed. Hudi (0.5.1-incubating) is a 
compile time dependency. | Snapshot querying on COW tables. Read optimized 
querying on MOR tables. |
+| >= 0.240             | No action needed. Hudi 0.5.3 version is a compile 
time dependency. | Snapshot querying on both COW and MOR tables |
+
+## Trino
+:::note
+[Trino](https://trino.io/) (formerly PrestoSQL) was forked off of PrestoDB a 
few years ago. Hudi supports 'Snapshot' queries for Copy-On-Write tables and 
'Read Optimized' queries
+for Merge-On-Read tables. This is through the initial input format based 
integration in PrestoDB (pre forking). This approach has
+known performance limitations with very large tables, which has been since 
fixed on PrestoDB. We are working on replicating the same fixes on Trino as 
well.
+:::
+
+To query Hudi tables on Trino, please place the `hudi-presto-bundle` jar into 
the Hive connector installation `<trino_install>/plugin/hive-hadoop2`.
+
+## Hive
+
+In order for Hive to recognize Hudi tables and query correctly,
+- the HiveServer2 needs to be provided with the 
`hudi-hadoop-mr-bundle-x.y.z-SNAPSHOT.jar` in its [aux jars 
path](https://www.cloudera.com/documentation/enterprise/5-6-x/topics/cm_mc_hive_udf#concept_nc3_mms_lr).
 This will ensure the input format
+  classes with its dependencies are available for query planning & execution.
+- For MERGE_ON_READ tables, additionally the bundle needs to be put on the 
hadoop/hive installation across the cluster, so that queries can pick up the 
custom RecordReader as well.
+
+In addition to setup above, for beeline cli access, the `hive.input.format` 
variable needs to be set to the fully qualified path name of the
+inputformat `org.apache.hudi.hadoop.HoodieParquetInputFormat`. For Tez, 
additionally the `hive.tez.input.format` needs to be set
+to `org.apache.hadoop.hive.ql.io.HiveInputFormat`. Then proceed to query the 
table like any other Hive table.
diff --git a/website/versioned_docs/version-0.9.0/table_types.md 
b/website/versioned_docs/version-0.9.0/table_types.md
new file mode 100644
index 0000000..1f0b767
--- /dev/null
+++ b/website/versioned_docs/version-0.9.0/table_types.md
@@ -0,0 +1,7 @@
+---
+title: Table Types
+summary: "In this page, we describe the different tables types in Hudi."
+toc: true
+last_modified_at:
+---
+

[hudi] branch asf-site updated: [HUDI-2670] - relative links broken in docs (#3907)

Reply via email to