This is an automated email from the ASF dual-hosted git repository.
hansva pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-hop.git
The following commit(s) were added to refs/heads/master by this push:
new 928174e HOP-2406 : Document Hop user best practices
new fcb2339 Merge pull request #744 from mattcasters/master
928174e is described below
commit 928174e2e0ed6000105e342472292ee272526214
Author: Matt Casters <[email protected]>
AuthorDate: Thu Apr 15 15:53:11 2021 +0200
HOP-2406 : Document Hop user best practices
---
.../ROOT/assets/images/best-practices-naming.png | Bin 0 -> 114556 bytes
.../ROOT/pages/best-practices/best-practices.adoc | 113 +++++++++++++++++++++
.../ROOT/pages/getting-started/hop-next-steps.adoc | 3 +-
3 files changed, 115 insertions(+), 1 deletion(-)
diff --git
a/docs/hop-user-manual/modules/ROOT/assets/images/best-practices-naming.png
b/docs/hop-user-manual/modules/ROOT/assets/images/best-practices-naming.png
new file mode 100644
index 0000000..1c32e47
Binary files /dev/null and
b/docs/hop-user-manual/modules/ROOT/assets/images/best-practices-naming.png
differ
diff --git
a/docs/hop-user-manual/modules/ROOT/pages/best-practices/best-practices.adoc
b/docs/hop-user-manual/modules/ROOT/pages/best-practices/best-practices.adoc
new file mode 100644
index 0000000..9dd7746
--- /dev/null
+++ b/docs/hop-user-manual/modules/ROOT/pages/best-practices/best-practices.adoc
@@ -0,0 +1,113 @@
+[[BestPractices]]
+:imagesdir: ../assets/images
+
+= Best practices
+
+== Introduction
+
+Apache Hop gives you a large amount of freedom when deciding how to do the
things you want to do.
+This freedom means you can be creative and productive in arriving at the
desired outcome.
+So please consider the advice given on this page as tips or free advice to be
taken or rejected for a particular situation. Only you can decide what the
advice is worth.
+
+== Naming
+
+=== Transforms and actions
+
+When looking at a pipeline or workflow it's so much easier to see what's going
on if you give meaningful names to the transforms and workflows.
+
+image::best-practices-naming.png[Showing the differences when giving
transforms a descriptive name]
+
+For input and output files, why not use the filename you're using?
+
+TIP: You can use any unicode character in the name of a transform or action
and even newlines are allowed.
+
+=== Metadata
+
+When giving names to metadata objects like relational database connections,
avoid environment specific words like a country, region, lifecycle environments:
+
+Here are some examples to avoid and possible alternatives:
+
+* Test Database -> CRM
+* France MySQL -> WWW
+* East Coast Cluster -> Fraud
+
+=== Naming standard
+
+What you can see in larger projects is that projects can become quite
cluttered after a while.
+Here is some advice on how to keep things tidy and clean...
+
+* Folders can have sub-folders: organize your work with sub-folders in a
project. It's just hard to work with hundreds of pipelines and workflows in a
single folder.
+* Use a naming convention for any object that you need to give a name:
pipelines, workflows, folders, names, tables, fields, ...
+* For larger projects you should consider setting up a naming standard. It's
just a document in which you specify how you want to name things.
+* Make sure to adjust, verify and enforce your naming standard periodically.
If you don't plan to do this you might was well not set up a corporate standard.
+* Consider scripts and commit hooks or a nightly run on your build server
(Jenkins) to validate your naming standard.
+
+== Size matters
+
+Limit the amount of transforms or actions!
+
+* Larger pipelines or workflows become harder to debug and develop against.
+* For every transform you add to a pipeline you start at least one new thread
at runtime. You could be slowing down significantly simply by having hundreds
of threads for hundreds of transforms.
+
+If you find that you need to split up a pipeline you can write intermediate
data to a temporary file using the
xref:pipeline/transforms/cubeoutput.adoc[Serialize to file] transform. The
next pipeline in a workflow can then pick up the data again with the
xref:pipeline/transforms/cubeinput.adoc[De-serialize from file] transform.
While obviously you can also use a database or use another file type to do the
same, these transforms will perform the fastest.
+
+== Variables
+
+xref:variables.adoc[Variables] provide an easy way to avoid hard-coding all
sorts of things in your system, environment or project. Here is some best
practices advice on the subject:
+
+* Put environment specific settings in an environment (Duh!) configuration
file. Create an environment for this.
+* When referencing file locations, prefer `${PROJECT_HOME}` over expressions
like `${Internal.Entry.Current.Directory}` or
`${Internal.Pipeline.Filename.Directory}`
+* Configure transform copies with variables to allow for easy transition
between differently sized environments.
+
+== Logging
+
+Take some time to capture the logging of your workflows and pipelines. Any
time you run anything you want to have a trace of it. Things tend to go wrong
when you least expect it and at that point you like being able to see what
happened. See xref:logging/logging-basics.adoc[Logging Basics],
xref:logging/logging-reflection.adoc[Logging Reflection] or consider logging to
a xref:technology/neo4j/neo4j-info.adoc[Neo4j] graph database. This last one
allows you to browse the logging result [...]
+
+== Mappings
+
+If you have recurring logic in various pipelines, consider writing Mapping.
It is a pipeline reading from a
xref:pipeline/transforms/mapping-input.adoc[Mapping Input] and writing to a
xref:pipeline/transforms/mapping-output.adoc[Mapping Output] transform. You
can re-use the work in other pipelines using the
xref:pipeline/transforms/simple-mapping.adoc[Simple Mapping] transform.
+
+== Metadata Injection
+
+If you find that you need to create 'almost' the same pipeline a lot of times,
consider that you can use xref:pipeline/transforms/metainject.adoc[Metadata
Injection] to create re-usable template pipelines.
+
+* Avoid manual population of dialogs
+* Whenever you need dynamic ETL
+* Supports data streaming
+
+Example use cases include:
+* Load 50 different file formats into a database with one pipeline template
+* Automatically normalize and load property sets
+
+== Performance basics
+
+Here are a few things to consider when looking at performance in a pipeline:
+
+* Pipelines are networks. The speed of the network is limited by the slowest
transform in it.
+* Slow transforms are indicated when running in Hop GUI. You'll see a dotted
line around the slow transforms.
+* Adding more copies and increasing parallelism is not always beneficial, but
it can be. The proof of that pudding is in the eating so take note of
execution times and see if you should increase or decrease parallelism to help
performance.
+
+== Governance
+
+Here are some self-evident pieces of advice:
+
+* Version control your project folder. Please consider using git.
+* Reference cases in commits
+* Make sure to have backups
+* Run continuous integration
+* Set up lifecycle environments (development, test, acceptance, production) as
needed
+* Test your pipelines with unit tests
+* Run all unit tests regularly
+* Validate the results & take action if needed
+
+== Loops
+
+Avoid looping in workflows. The easiest way to loop over a set of values,
rows, files, ... is to use an Executor transform.
+
+* xref:pipeline/transforms/pipelineexcecutor.adoc[Pipeline Executor] : run a
pipeline for each input row
+* xref:pipeline/transforms/workflowexecutor.adoc[Workflow Executor] : run a
workflow for each input row
+
+You can nicely map field values to parameters of the pipeline or workflow
making loops a breeze.
+
+
+
diff --git
a/docs/hop-user-manual/modules/ROOT/pages/getting-started/hop-next-steps.adoc
b/docs/hop-user-manual/modules/ROOT/pages/getting-started/hop-next-steps.adoc
index eb58f8a..476ed74 100644
---
a/docs/hop-user-manual/modules/ROOT/pages/getting-started/hop-next-steps.adoc
+++
b/docs/hop-user-manual/modules/ROOT/pages/getting-started/hop-next-steps.adoc
@@ -11,6 +11,7 @@ There's a lot more to discover in Hop. Here are a couple of
topics you may want
* xref:pipeline/pipelines.adoc[Pipelines] takes closer look at the various
aspects of creating and running pipelines, and contains the entire list of
transforms that are at your disposal
* xref:workflow/workflows.adoc[Workflows] takes a closer look at the various
aspects of create and running workflows, and contains the entire list of
actions that are at your disposal
-* xref:projects/projects.adoc[Projects] explains how to work with projects and
environments
+* xref:best-practices/best-practices.adoc[Best Practices] covers a number of
things you might want to think about while using Apache Hop.
+* xref:projects/index.adoc[Projects] explains how to work with projects and
environments
* xref:vfs.adoc[VFS] explains how you can access resources in the 3 main cloud
platforms: AWS, Azure and GCP.
* xref:logging/logging-basics.adoc[Logging] explains how to configure Hop for
your desired log level and target platform