[hop] branch master updated: HOP-4125 initial version of joins and lookups page

hansva Mon, 15 Aug 2022 09:50:38 -0700

This is an automated email from the ASF dual-hosted git repository.

hansva pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hop.git



The following commit(s) were added to refs/heads/master by this push:
     new 8d187df922 HOP-4125 initial version of joins and lookups page
     new cdbe21cc33 Merge pull request #1632 from bamaer/HOP-4125
8d187df922 is described below

commit 8d187df922cd0d8e99ea9e0cafacfb99f3399a1c
Author: Bart Maertens <[email protected]>
AuthorDate: Sat Aug 13 13:36:26 2022 +0200

    HOP-4125 initial version of joins and lookups page
---
 docs/hop-user-manual/modules/ROOT/nav.adoc         |   1 +
 .../ROOT/pages/how-to-guides/joins-lookups.adoc    | 160 +++++++++++++++++++++
 2 files changed, 161 insertions(+)

diff --git a/docs/hop-user-manual/modules/ROOT/nav.adoc 
b/docs/hop-user-manual/modules/ROOT/nav.adoc
index 9894c25173..793eadbc4f 100644
--- a/docs/hop-user-manual/modules/ROOT/nav.adoc
+++ b/docs/hop-user-manual/modules/ROOT/nav.adoc
@@ -443,4 +443,5 @@ under the License.
 * xref:hop-usps.adoc[Unique Selling Propositions]
 * xref:how-to-guides/index.adoc[How-to guides]
 ** xref:how-to-guides/apache-hop-web-services-docker.adoc[Hop web services in 
Docker]
+** xref:how-to-guides/joins-lookups.adoc[Joins and lookups]
 ** xref:how-to-guides/run-hop-in-apache-airflow.adoc[Run Hop workflows and 
pipelines in Apache Airflow]
\ No newline at end of file
diff --git 
a/docs/hop-user-manual/modules/ROOT/pages/how-to-guides/joins-lookups.adoc 
b/docs/hop-user-manual/modules/ROOT/pages/how-to-guides/joins-lookups.adoc
new file mode 100644
index 0000000000..9108a1d293
--- /dev/null
+++ b/docs/hop-user-manual/modules/ROOT/pages/how-to-guides/joins-lookups.adoc
@@ -0,0 +1,160 @@
+////
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+  http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+////
+[[JoinsLookups]]
+:imagesdir: ../../assets/images
+:description: This guide provides an overview of the various join and lookup 
transforms and helps you to choose which transform makes most sense for your 
use case.
+
+= Joins and lookups
+
+Hop offers a lot of options to enrich and combine data, among other things 
through joins and lookups. This guide explains the differences between the 
various joins and lookups, and will help you to choose the right transform for 
your pipelines.
+
+=== Joins
+
+Joins are a way to merge data from two or more different streams in your 
pipeline. Joins typically require you to select a set of fields (the join keys) 
to join the streams on. The data in all streams that are involved in the join 
needs to be sorted on the join keys.
+
+Depending on the transform you use and the configuration options you specify, 
joins can change the amount of data you're processing in both way. For example, 
an inner `Merge join` with a limited number of matches between the keys in both 
streams can significantly reduce the number of rows in your stream. A `Join 
Rows (Cartesian Product)` on the other hand can result in an explosion of the 
number of rows you started the pipeline with.
+
+TIP: Multiple streams in your pipeline that have the exact same layout 
(identical fields, identical data types, identical order) can be joined or 
merged without the need for a transform. Just create a hop from the last 
transform in two or more substreams to a target transform. Hop Gui will 
complain if there are differences in the layout of the streams you're combining.
+
+The available join transforms are:
+
+|===
+|Transform|When to use
+|xref:pipeline/transforms/joinrows.adoc[**Join Rows (Cartesian Product)**]
+
+produce combinations (Cartesian product) of all rows in the input streams
+|situations where you want to create a combination of _all_ data in one stream 
with _all_ data in another
+
+**example**: you want to create a copy of a data set for all members in a 
team, a list of available months or similar.
+
+WARN: Use with care on large volumes of data. The combination of all data in 
both streams easily become huge.
+
+|xref:pipeline/transforms/mergejoin.adoc[**Merge Join**]
+
+classic merge join between data sets with data coming from two different input 
transforms
+|You want to combine data from two different streams based on a matching key 
in both streams, and want to continue working with (a selection of) the 
combined data from both streams.
+
+**example**: a subset of your data is available in files, another subset is 
available in a database table. Both the file have a matching key (e.g an ID) 
that you can use to combine both sets.
+
+|xref:pipeline/transforms/mergerows.adoc[**Merge Rows (diff)**]
+
+compares and merges data within two rows of data, adding a field indicator for 
each row
+|You have two sets of data with the same layout. You want to compare both data 
sets and identify which lines in one set a new, identical, changed or deleted 
compared to the other data set.
+
+**example***: you receive a daily file that needs to be processed. Compare 
today's file with yesterday's file, so you can ignore the identical data and 
only process the changes.
+
+|xref:pipeline/transforms/multimerge.adoc[**Multiway merge join**]
+
+joins input data from multiple streams
+|Join data from multiple (typically more than two) streams, similar to merge 
join. This transform provides a shortcut to what you otherwise would do by 
chaining multiple merge joins after each other.
+
+**example**: you receive data from files, a database query and a REST api. All 
three streams have one or more corresponding keys to join data from the various 
streams.
+
+|xref:pipeline/transforms/xmljoin.adoc[**XML Join**]
+
+adds xml tags from one stream into a leading XML structure from a second stream
+|You need to build XML files from various data sources and with multiple 
nested elements. This transform is atypical in the sense that you're performing 
joins to build output instead of the other join transforms that combine 
incoming data for further processing.
+
+**example**: you're building complex XML output files.
+
+|xref:pipeline/transforms/databasejoin.adoc[**Database join**]
+
+allows you to run a query against a database using data obtained from previous 
transforms
+|The database join executes a (parameterized) SQL query for every row you 
execute. While this transform gives you unparallelled flexibility, it performs 
a database query for every single row, so it is a potential performance 
bottleneck.
+
+**example**: You need to fetch information that can't be retrieved from a 
single table and requires a query that fetches data from multiple tables.
+
+|===
+
+=== Lookups
+
+Lookups are a way to let you enrich your pipeline's stream with additional 
fields from external sources. The transforms allow you to choose which 
additional fields will be retrieved from the lookup source.
+
+Contrary to joins, lookups work on the current pipeline stream and your never 
change the number of rows you're processing.
+
+There are a number of transforms in the `Lookup` category that are considered 
self-explanatory in a lookup context and won't be included here. These 
transforms are:
+
+* xref:pipeline/transforms/calldbproc.adoc[Call DB Procedure]
+* xref:pipeline/transforms/checkfilelocked.adoc[Check if file is locked]
+* xref:pipeline/transforms/columnexists.adoc[Column exists]
+* xref:pipeline/transforms/dynamicsqlrow.adoc[Dynamic SQL row]
+* xref:pipeline/transforms/fileexists.adoc[File exists]
+* xref:pipeline/transforms/fuzzymatch.adoc[Fuzzy match]
+* xref:pipeline/transforms/serverstatus.adoc[Get server status]
+* xref:pipeline/transforms/http.adoc[HTTP client]
+* xref:pipeline/transforms/httppost.adoc[HTTP post]
+* xref:pipeline/transforms/rest.adoc[REST client]
+* xref:pipeline/transforms/tableexists.adoc[Table exists]
+* xref:pipeline/transforms/webservices.adoc[Web services lookup]
+
+|===
+|Transform|When to use
+|xref:pipeline/transforms/combinationlookup.adoc[**Combination lookup/update**]
+
+allows you to store information in a junk-dimension table. The transform can 
possibly also be used to maintain Kimball pure Type 1 dimensions.
+
+|in a lookup context, this transform gets data from a 
https://en.wikipedia.org/wiki/Slowly_changing_dimension#Type_1:_overwrite[type 
1 slowly changing dimension^].
+
+|xref:pipeline/transforms/dimensionlookup.adoc[**Dimension lookup/update**]
+
+allows you to implement Ralph Kimball’s slowly changing dimension for both 
types: Type I (update) and Type II (insert) together with some additional 
functions.
+
+This transform can be used to populate a dimension table or to look up values 
in a slowly changing dimension without making any changes.
+
+|In a lookup context, this transform lets you perform lookups in a 
https://en.wikipedia.org/wiki/Slowly_changing_dimension#Type_2:_add_new_row[type
 2 slowly changing dimension^].
+
+In a lookup context, you typically use the natural keys and date or timestamp 
for your data to retrieve the surrogate key for the correct version of the data 
in a dimension.
+
+Uncheck the `Update the dimension` checkbox to use the transform as a lookup 
and to prevent it from updating the dimension table.
+
+**example**: you want to retrieve the correct surrogate key for a dimension 
record in a slowly changing customer table in your data warehouse.
+
+|xref:pipeline/transforms/databaselookup.adoc[**Database lookup**]
+
+allows you to look up values in a database table. Lookup values are added as 
new fields onto the stream.
+
+|You need to retrieve additional information from a single database table, 
based on fields in your pipeline stream.
+
+**example**: you're processing data that contains a customer's id. You also 
need the customer's name and location, which is available in a database table.
+
+|xref:pipeline/transforms/streamlookup.adoc[**Stream lookup**]
+
+allows you to look up data using information coming from other transforms in 
the pipeline. The data coming from the Source transform is first read into 
memory and is then used to look up data from the main stream.
+
+|use the Stream lookup when you need to perform a lot of lookups on a fairly 
limited set of data.
+
+The performance benefit of performing lookups from memory comes with the 
limitation of the available memory.
+
+The stream lookup accepts data from any pipeline stream, but since the data 
needs to be loaded into memory before the pipeline starts, consider your 
available memory resources the memory footprint for your Stream lookup's input 
before adding it to your pipeline. In the default Hop configuration with 2 GB 
of memory, a couple of megabytes is fine, a 20GB CSV file as Stream lookup 
input isn't.
+
+**example** you need to process a large dataset that contains product prices 
in a number of different currencies. You received a list of historical 
conversion rates for the last month in a csv file. This lookup file is about 
5000 lines, loading it into memory an picking up the right conversion rate with 
a `Stream lookup` transform allows you to perform lightning fast lookups.    
+
+|===
+
+
+=== Lookup Caching
+
+The xref:pipeline/transforms/combinationlookup.adoc[Combination 
lookup/update], xref:pipeline/transforms/dimensionlookup.adoc[Dimension 
lookup/update] and xref:pipeline/transforms/databaselookup.adoc[Database 
lookup] allow you to cache data. Cached data is stored in memory to speed up 
the lookup process.
+
+The dialog options vary slightly, but the caching options offered by these 
transforms are:
+
+* enable cache? (boolean): switches caching on or off
+* cache size: the size (number of rows) to cache in memory
+* pre-load cache: load data in the cache before the pipeline starts
+* load all data from table (database lookup only): load all the table's data 
to memory before the pipeline starts.
+
+Consider your available memory resources when configuring caching. Caching can 
significantly speed up the lookup process, but memory is a finite resource.

[hop] branch master updated: HOP-4125 initial version of joins and lookups page

Reply via email to