[incubator-hop] branch master updated: HOP-3133 : Documentation: add Getting started with Beam guide

mcasters Tue, 27 Jul 2021 05:27:17 -0700

This is an automated email from the ASF dual-hosted git repository.

mcasters pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-hop.git



The following commit(s) were added to refs/heads/master by this push:
     new 25a30e4  HOP-3133 : Documentation: add Getting started with Beam guide
     new 17f903e  Merge pull request #967 from mattcasters/master
25a30e4 is described below

commit 25a30e49a6c62567792c5a7dc4f2a48a7d6e44ba
Author: Matt Casters <[email protected]>
AuthorDate: Tue Jul 27 14:25:22 2021 +0200

    HOP-3133 : Documentation: add Getting started with Beam guide
---
 .../beam-getting-started-beam-file-definition.png  | Bin 0 -> 145198 bytes
 .../images/beam-getting-started-flushes-metric.png | Bin 0 -> 98819 bytes
 ...etting-started-input-transforms-on-dataflow.png | Bin 0 -> 71602 bytes
 ...getting-started-input-process-output-sample.png | Bin 0 -> 57642 bytes
 .../pipeline/beam/getting-started-with-beam.adoc   | 179 +++++++++++++++++++++
 .../beam-dataflow-pipeline-engine.adoc             |  63 +++++---
 .../modules/ROOT/pages/pipeline/pipelines.adoc     |   1 +
 7 files changed, 224 insertions(+), 19 deletions(-)

diff --git 
a/docs/hop-user-manual/modules/ROOT/assets/images/beam-getting-started-beam-file-definition.png
 
b/docs/hop-user-manual/modules/ROOT/assets/images/beam-getting-started-beam-file-definition.png
new file mode 100644
index 0000000..28b4d41
Binary files /dev/null and 
b/docs/hop-user-manual/modules/ROOT/assets/images/beam-getting-started-beam-file-definition.png
 differ
diff --git 
a/docs/hop-user-manual/modules/ROOT/assets/images/beam-getting-started-flushes-metric.png
 
b/docs/hop-user-manual/modules/ROOT/assets/images/beam-getting-started-flushes-metric.png
new file mode 100644
index 0000000..9e3002d
Binary files /dev/null and 
b/docs/hop-user-manual/modules/ROOT/assets/images/beam-getting-started-flushes-metric.png
 differ
diff --git 
a/docs/hop-user-manual/modules/ROOT/assets/images/beam-getting-started-input-transforms-on-dataflow.png
 
b/docs/hop-user-manual/modules/ROOT/assets/images/beam-getting-started-input-transforms-on-dataflow.png
new file mode 100644
index 0000000..975ba76
Binary files /dev/null and 
b/docs/hop-user-manual/modules/ROOT/assets/images/beam-getting-started-input-transforms-on-dataflow.png
 differ
diff --git 
a/docs/hop-user-manual/modules/ROOT/assets/images/getting-started-input-process-output-sample.png
 
b/docs/hop-user-manual/modules/ROOT/assets/images/getting-started-input-process-output-sample.png
new file mode 100644
index 0000000..f70f0d5
Binary files /dev/null and 
b/docs/hop-user-manual/modules/ROOT/assets/images/getting-started-input-process-output-sample.png
 differ
diff --git 
a/docs/hop-user-manual/modules/ROOT/pages/pipeline/beam/getting-started-with-beam.adoc
 
b/docs/hop-user-manual/modules/ROOT/pages/pipeline/beam/getting-started-with-beam.adoc
new file mode 100644
index 0000000..9cf3649
--- /dev/null
+++ 
b/docs/hop-user-manual/modules/ROOT/pages/pipeline/beam/getting-started-with-beam.adoc
@@ -0,0 +1,179 @@
+////
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+  http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+////
+[[GettingStartedWithBeam]]
+:imagesdir: ../assets/images
+
+= Getting started with Apache Beam
+
+== What is Apache Beam?
+
+https://beam.apache.org[Apache Beam] is an advanced unified programming model 
that allows you to implement batch and streaming data processing jobs that run 
on any execution engine.
+Popular execution engines are for example https://spark.apache.org[Apache 
Spark], https://flink.apache.org[Apache Flink] or 
https://cloud.google.com/dataflow[Google Cloud Platform Dataflow].
+
+== How does it work?
+
+Apache Beam allows you to create programs in a variety of programming 
languages like Java, Python and Go using a standard 
https://beam.apache.org/documentation/programming-guide/[Beam API].
+These programs build data 
https://beam.apache.org/documentation/programming-guide/#creating-a-pipeline[pipelines]
 which can then be executed using Beam 
https://beam.apache.org/documentation/runners/capability-matrix/[runners] on 
the various execution engines.
+
+== How is Hop using Beam?
+
+Hop is using the Beam API to create Beam pipelines based off of your visually 
designed Hop pipelines.
+The terminology of Hop and Beam are aligned because they mean the same thing.
+Hop provides 4 standard ways to execute a pipeline that you designed on Spark, 
Flink, Dataflow or on the Direct runner.
+
+Here is the documentation for the relevant plugins:
+
+* 
xref:pipeline/pipeline-run-configurations/beam-spark-pipeline-engine.adoc[Beam 
Spark pipeline engine]
+* 
xref:pipeline/pipeline-run-configurations/beam-flink-pipeline-engine.adoc[Beam 
Flink pipeline engine]
+* 
xref:pipeline/pipeline-run-configurations/beam-dataflow-pipeline-engine.adoc[Beam
 Dataflow pipeline engine]
+* 
xref:pipeline/pipeline-run-configurations/beam-direct-pipeline-engine.adoc[Beam 
Direct pipeline engine]
+
+== How are my pipelines executed?
+
+An Apache Hop pipeline is just metadata.
+The various beam pipeline engine plugins look at this metadata one transform 
at a time.
+It decides what to do with it based on a Hop transform handler which is 
provided.
+The handlers split are in general split into a different types...
+
+=== Beam specific transforms
+
+There are a number of Beam specific transforms available which only work on 
the provided Beam pipeline execution engines.
+For example: xref:pipeline/transforms/beaminput.adoc[Beam Input] which reads 
text file data from one or more files or 
xref:pipeline/transforms/beambigqueryoutput.adoc[Beam BigQuery Output] which 
writes data to BigQuery.
+
+You can find these transforms in the `Big Data` category and their names all 
start with `Beam` to make is easy to recognize them.
+
+Here is an example of a simple pipeline which read files in a folder (on 
`gs://`), filters out data from California, removes and renames a few fields 
and writes the data back to another set of files:
+
+image::getting-started-input-process-output-sample.png[Beam 
input-process-output-sample]
+
+=== Universal transforms
+
+There are a few transforms which are translated into Beam variations:
+
+* xref:pipeline/transforms/memgroupby.adoc[Memory Group By]: This transform 
allows you to aggregate data across large data volumes.
+When using the Beam engines it uses 
`org.apache.beam.sdk.transforms.GroupByKey`.
+* xref:pipeline/transforms/mergejoin.adoc[Merge Join]: You can join 2 data 
sources with this transform.
+The main difference is that in the Beam engines the input data doesn't need to 
be sorted.
+The Beam class used to perform this is: 
`org.apache.beam.sdk.extensions.joinlibrary.Join`.
+* xref:pipeline/transforms/rowgenerator.adoc[Generate Rows]: This transform is 
used to generate (empty/static) rows of data.
+It can be either a fixed number, or it can generate rows indefinitely.
+When using the Beam engines it uses 
`org.apache.beam.sdk.io.synthetic.SyntheticBoundedSource` or 
`org.apache.beam.sdk.io.synthetic.SyntheticUnboundedSource`.
+
+=== Unsupported transforms
+
+A few transforms are simply not supported because we haven't found a good way 
to do this on Beam yet:
+
+* xref:pipeline/transforms/uniquerows.adoc[Unique Rows]
+* xref:pipeline/transforms/groupby.adoc[Group By] : Use the `Memory Group By` 
instead
+* xref:pipeline/transforms/sort.adoc[Sort Rows]
+
+=== All other transforms
+
+All other transforms are simply supported.
+They are wrapped in a bit of code to make the exact same code that runs on the 
Hop local pipeline engine work in a Beam pipeline.
+There are a few things to mention though.
+
+|===
+|Special case |Solution
+
+|Info transforms
+|Some transforms like `Stream Lookup` read data from other transforms.
+This is handled by 
https://beam.apache.org/documentation/patterns/side-inputs/[side-inputs] for 
the data in the Beam API and is as such fully supported.
+
+|Target transforms
+|Sometimes you want to target specific transforms like in `Switch Case` or 
`Filter Rows`.
+This is fully supported as well and handled by the Beam API which handles 
https://beam.apache.org/documentation/programming-guide/#additional-outputs[additional
 outputs].
+
+|Non-Beam input transforms
+|When you're reading data using a non-beam transform (see `Beam specific 
transforms` above) we need to make sure that this transform is executed in 
exactly one thread.
+Otherwise, you might read your XML or JSON document many times by the 
inherently parallel intentions of the various engines.
+This is handled by doing a Group By over a single value.
+You'll see the following in for example your Dataflow pipeline: 
`Create.Values` -> `WithKeys` -> `GroupByKey` -> `Values` -> 
`Flatten.Iterables` -> `ParDo` ... and all this is just done to make sure we 
only ever execute our transform once.
+
+image:beam-getting-started-input-transforms-on-dataflow.png[Non-Beam input 
transforms on Dataflow]
+
+|Non-Beam Output transforms
+|The insistence of a Beam pipeline to run work in parallel can also trip you 
up on the output side.
+In rare cases maybe you don't want a server to be bombarded by dozens of 
inbound connections.
+To limit the amount of output copies you can include *`SINGLE_BEAM`* in the 
number of copies value of a transform (click on the transform and select 
`Number of copies` in the Hop GUI).
+This will do a GroupBy exercise over all records to iterate over those and 
force a single thread.
+
+|Row batching with non-Beam transforms
+|A lot of target databases like to receive rows in batches of records.
+So if you have a transform like for example `Table Output` or `Neo4j Output` 
you might see that performance is not that great.
+This is because by default the 
https://beam.apache.org/documentation/runtime/model/[Beam programming model] is 
designed to stream rows of data through a pipeline in `bundles` and the Hop API 
only knows about a single record at once.
+For these transforms you can include *`BATCH`* in the number of copies string 
of a transform click on the transform and select `Number of copies` in the Hop 
GUI).
+For these flagged transforms you can then specify 2 parameters in your Beam 
pipeline run configurations.
+When you set these you can determine how long rows are kept behind before 
being forced to the transforms in question
+
+*Streaming Hop transform flush interval*: how long in time are rows kept and 
batched up?
+If you care about latency make this lower (500 or lower).
+If you have a long-running batching pipeline, make it higher (10000 or higher 
perhaps).
+
+*Hop streaming transforms buffer size*: how many rows are being batched?
+Consider making it the same as the batching size you use in your transform 
metadata (e.g. `Table Output`, `Neo4j Cypher`, ...)
+
+Please note that these are maximum values.
+If the end of a bundle is reached in a pipeline rows are always forced to the 
transform code and as such pushed to the target system.
+To get an idea of how many times a batching buffer is flushed to the 
underlying transform code (and as such to for example a remote database) we 
added a `Flushes` metric.
+You will notice this in your metrics view in the Hop GUI when executing.
+
+image:beam-getting-started-flushes-metric.png[Beam Flushes Metrics]
+
+|===
+
+== Fat jars?
+
+A fat jar is often used to package up all the code you need for a particular 
project.
+The Spark, Flink and Dataflow execution engines like it since it massively 
simplifies the Java classpath when executing pipelines.
+Apache Hop allows you to create a fat jar in the Hop GUI with the 
`Tools/Generate a Hop fat jar...` menu or using the following command:
+
+[source]
+----
+sh hop-config.sh -fj /path/to/fat.jar
+----
+
+The path to this fat jar can then be referenced in the various Beam runtime 
configurations.
+Note that the current version of Hop and all its plugins are used to build the 
fat jar.
+If you install or remove plugins or update Hop itself make sure to remember to 
generate a new fat jar or to update it.
+
+== Beam File definitions
+
+The xref:pipeline/transforms/beaminput.adoc[Beam Input] and 
xref:pipeline/transforms/beamoutput.adoc[Beam Output] transforms expect you to 
define the layout of the file(s) being read or written.
+
+image::beam-getting-started-beam-file-definition.png[Beam File Definition 
example]
+
+== Current limitations
+
+There are some specific advantages to using engines like Spark, Flink and 
Dataflow.
+However, with it come some limitations as well...
+
+* Previewing data is not available (yet).
+Because of the distributed nature of execution we don't have a great way to 
acquire preview data.
+* Unit testing: not available for similar reasons compared to previewing or 
debugging.
+To test your Beam pipelines pick up data after a pipeline is done and compare 
that to a golden data set in another pipeline running with a "Local Hop" 
pipeline engine.
+* Debugging or pausing a pipeline is not supported
+
+
+
+
+
+
+
+
+
+
diff --git 
a/docs/hop-user-manual/modules/ROOT/pages/pipeline/pipeline-run-configurations/beam-dataflow-pipeline-engine.adoc
 
b/docs/hop-user-manual/modules/ROOT/pages/pipeline/pipeline-run-configurations/beam-dataflow-pipeline-engine.adoc
index ffb0cd2..6c6b9b6 100644
--- 
a/docs/hop-user-manual/modules/ROOT/pages/pipeline/pipeline-run-configurations/beam-dataflow-pipeline-engine.adoc
+++ 
b/docs/hop-user-manual/modules/ROOT/pages/pipeline/pipeline-run-configurations/beam-dataflow-pipeline-engine.adoc
@@ -39,59 +39,84 @@ To use the Google Cloud Dataflow runtime configuration, you 
must complete the se
 
 * Select or create a Google Cloud Platform Console project.
 * Enable billing for your project.
-* Enable the required Google Cloud APIs: Cloud Dataflow, Compute Engine, 
Stackdriver Logging, Cloud Storage, Cloud Storage JSON, and Cloud Resource 
Manager. You may need to enable additional APIs (such as BigQuery, Cloud 
Pub/Sub, or Cloud Datastore) if you use them in your pipeline code.
+* Enable the required Google Cloud APIs: Cloud Dataflow, Compute Engine, 
Stackdriver Logging, Cloud Storage, Cloud Storage JSON, and Cloud Resource 
Manager.
+You may need to enable additional APIs (such as BigQuery, Cloud Pub/Sub, or 
Cloud Datastore) if you use them in your pipeline code.
 * Authenticate with Google Cloud Platform.
 * Install the Google Cloud SDK.
 * Create a Cloud Storage bucket.
 
-
 == Options
 
-[width="90%", options="header", cols="1, 3"]
+[width="90%",options="header",cols="1, 3"]
 |===
 |Option|Description
-|Project ID|   The project ID for your Google Cloud Project. This is required 
if you want to run your pipeline using the Dataflow managed service.
-|Application name|The name of the Dataflow job being executed as it appears in 
Dataflow's jobs list and job details. Also used when updating an existing 
pipeline.
-|Staging location|Cloud Storage path for staging local files. Must be a valid 
Cloud Storage URL, beginning with gs://.
-|Initial number of workers|The initial number of Google Compute Engine 
instances to use when executing your pipeline. This option determines how many 
workers the Dataflow service starts up when your job begins.
-|Maximum number of workers|The maximum number of Compute Engine instances to 
be made available to your pipeline during execution. Note that this can be 
higher than the initial number of workers (specified by numWorkers to allow 
your job to scale up, automatically or otherwise.
-|Auto-scaling algorithm a|The autoscaling mode for your Dataflow job. Possible 
values are THROUGHPUT_BASED to enable autoscaling, or NONE to disable. See 
https://cloud.google.com/dataflow/service/dataflow-service-desc#Autotuning[Autotuning
 features] to learn more about how autoscaling works in the Dataflow managed 
service.
+|Project ID|    The project ID for your Google Cloud Project.
+This is required if you want to run your pipeline using the Dataflow managed 
service.
+|Application name|The name of the Dataflow job being executed as it appears in 
Dataflow's jobs list and job details.
+Also used when updating an existing pipeline.
+|Staging location|Cloud Storage path for staging local files.
+Must be a valid Cloud Storage URL, beginning with gs://.
+|Initial number of workers|The initial number of Google Compute Engine 
instances to use when executing your pipeline.
+This option determines how many workers the Dataflow service starts up when 
your job begins.
+|Maximum number of workers|The maximum number of Compute Engine instances to 
be made available to your pipeline during execution.
+Note that this can be higher than the initial number of workers (specified by 
numWorkers to allow your job to scale up, automatically or otherwise.
+|Auto-scaling algorithm a|The autoscaling mode for your Dataflow job.
+Possible values are THROUGHPUT_BASED to enable autoscaling, or NONE to disable.
+See 
https://cloud.google.com/dataflow/service/dataflow-service-desc#Autotuning[Autotuning
 features] to learn more about how autoscaling works in the Dataflow managed 
service.
 |Worker machine type|
-The Compute Engine machine type that Dataflow uses when starting worker VMs. 
You can use any of the available Compute Engine machine type families as well 
as custom machine types.
+The Compute Engine machine type that Dataflow uses when starting worker VMs.
+You can use any of the available Compute Engine machine type families as well 
as custom machine types.
 
-For best results, use n1 machine types. Shared core machine types, such as f1 
and g1 series workers, are not supported under the Dataflow Service Level 
Agreement.
+For best results, use n1 machine types.
+Shared core machine types, such as f1 and g1 series workers, are not supported 
under the Dataflow Service Level Agreement.
 
-Note that Dataflow bills by the number of vCPUs and GB of memory in workers. 
Billing is independent of the machine type family. Check the 
link:https://cloud.google.com/compute/docs/machine-types[list] of machine types 
for reference.
+Note that Dataflow bills by the number of vCPUs and GB of memory in workers.
+Billing is independent of the machine type family.
+Check the link:https://cloud.google.com/compute/docs/machine-types[list] of 
machine types for reference.
 |Worker disk type|The type of persistent disk to use, specified by a full URL 
of the disk type resource.
 
 For example, use compute.googleapis.com/projects//zones//diskTypes/pd-ssd to 
specify a SSD persistent disk.
 
 https://cloud.google.com/compute/docs/disks#pdspecs[more].
-|Disk size in GB|The disk size, in gigabytes, to use on each remote Compute 
Engine worker instance. If set, specify at least 30 GB to account for the 
worker boot image and local logs.
-|Region|Specifies a Compute Engine region for launching worker instances to 
run your pipeline. This option is used to run workers in a different location 
than the region used to deploy, manage, and monitor jobs. The zone for 
workerRegion is 
https://cloud.google.com/dataflow/docs/concepts/regional-endpoints#autozone[automatically
 assigned].
+|Disk size in GB|The disk size, in gigabytes, to use on each remote Compute 
Engine worker instance.
+If set, specify at least 30 GB to account for the worker boot image and local 
logs.
+|Region|Specifies a Compute Engine region for launching worker instances to 
run your pipeline.
+This option is used to run workers in a different location than the region 
used to deploy, manage, and monitor jobs.
+The zone for workerRegion is 
https://cloud.google.com/dataflow/docs/concepts/regional-endpoints#autozone[automatically
 assigned].
 
 Note: This option cannot be combined with workerZone or zone.
 
 (https://cloud.google.com/dataflow/docs/concepts/regional-endpoints[regions 
list]).
-|Zone|Specifies a Compute Engine zone for launching worker instances to run 
your pipeline. This option is used to run workers in a different location than 
the region used to deploy, manage, and monitor jobs.
+|Zone|Specifies a Compute Engine zone for launching worker instances to run 
your pipeline.
+This option is used to run workers in a different location than the region 
used to deploy, manage, and monitor jobs.
 
 Note: This option cannot be combined with workerRegion or zone.
 |User agent|A user agent string as per 
https://tools.ietf.org/html/rfc2616[RFC2616], describing the pipeline to 
external services.
-|Temp location|Cloud Storage path for temporary files. Must be a valid Cloud 
Storage URL, beginning with gs://.
+|Temp location|Cloud Storage path for temporary files.
+Must be a valid Cloud Storage URL, beginning with gs://.
 |Plugins to stage (, delimited)|Comma separated list of plugins.
 |Transform plugin classes|List of transform plugin classes.
 |XP plugin classes|List of extensions point plugins.
 |Streaming Hop transforms flush interval (ms)|The amount of time after which 
the internal buffer is sent completely over the network and emptied.
 |Hop streaming transforms buffer size|The internal buffer size to use.
-|Fat jar file location|Fat jar location. Generate a fat jar using `Tools -> 
Generate a Hop fat jar`. The generated fat jar file name will be copied to the 
clipboard.
+|Fat jar file location|Fat jar location.
+Generate a fat jar using `Tools -> Generate a Hop fat jar`.
+The generated fat jar file name will be copied to the clipboard.
 |===
 
-
 **Environment Settings**
 
 This environment variable need to be set locally.
 
-[source, bash]
+[source,bash]
 ----
 GOOGLE_APPLICATION_CREDENTIALS=/path/to/google-key.json
 ----
+
+== Security considerations
+
+To allow encrypted (TLS) network connections to, for example, Kafka and Neo4j 
Aura certain older security algorithms are 
https://github.com/apache/incubator-hop/blob/master/plugins/engines/beam/src/main/java/org/apache/hop/beam/engines/dataflow/DataFlowJvmStart.java[disabled
 on Dataflow].
+This is done by setting security property `jdk.tls.disabledAlgorithms` to 
value: `Lv3, RC4, DES, MD5withRSA, DH keySize < 1024, EC keySize < 224, 
3DES_EDE_CBC, anon, NULL`.
+
+Please let us know if you have a need to make this configurable and we'll look 
for a way to not hardcode this.
+Just create a JIRA case to let us know.
diff --git a/docs/hop-user-manual/modules/ROOT/pages/pipeline/pipelines.adoc 
b/docs/hop-user-manual/modules/ROOT/pages/pipeline/pipelines.adoc
index 59854e3..7ae704e 100644
--- a/docs/hop-user-manual/modules/ROOT/pages/pipeline/pipelines.adoc
+++ b/docs/hop-user-manual/modules/ROOT/pages/pipeline/pipelines.adoc
@@ -25,4 +25,5 @@ under the License.
 * 
xref:pipeline/pipeline-run-configurations/pipeline-run-configurations.adoc[Pipeline
 Run Configurations]
 * xref:pipeline/metadata-injection.adoc[Metadata Injection]
 * xref:pipeline/partitioning.adoc[Partitioning]
+* xref:pipeline/beam/getting-started-with-beam.adoc[Getting started with 
Apache Beam]
 * xref:pipeline/transforms.adoc[Transforms]

[incubator-hop] branch master updated: HOP-3133 : Documentation: add Getting started with Beam guide

Reply via email to