incubator-gobblin git commit: [GOBBLIN-598] Add documentation on split enabled distcp (config glossary & gobblin distcp page)

hutran Thu, 20 Dec 2018 09:58:20 -0800

Repository: incubator-gobblin
Updated Branches:
  refs/heads/master f8d791b6a -> 66b1fcda9



[GOBBLIN-598] Add documentation on split enabled distcp (config glossary & 
gobblin distcp page)

Closes #2526 from cshen98/split-distcp-docs


Project: http://git-wip-us.apache.org/repos/asf/incubator-gobblin/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-gobblin/commit/66b1fcda
Tree: http://git-wip-us.apache.org/repos/asf/incubator-gobblin/tree/66b1fcda
Diff: http://git-wip-us.apache.org/repos/asf/incubator-gobblin/diff/66b1fcda

Branch: refs/heads/master
Commit: 66b1fcda9b7a113ab6f17af1726704df05368fd7
Parents: f8d791b
Author: Carl Shen <[email protected]>
Authored: Thu Dec 20 09:57:38 2018 -0800
Committer: Hung Tran <[email protected]>
Committed: Thu Dec 20 09:57:38 2018 -0800

----------------------------------------------------------------------
 gobblin-docs/adaptors/Gobblin-Distcp.md         |  8 ++-
 .../Configuration-Properties-Glossary.md        | 71 ++++++++++++++++++++
 2 files changed, 78 insertions(+), 1 deletion(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-gobblin/blob/66b1fcda/gobblin-docs/adaptors/Gobblin-Distcp.md
----------------------------------------------------------------------
diff --git a/gobblin-docs/adaptors/Gobblin-Distcp.md 
b/gobblin-docs/adaptors/Gobblin-Distcp.md
index 1cda225..e6e38df 100644
--- a/gobblin-docs/adaptors/Gobblin-Distcp.md
+++ b/gobblin-docs/adaptors/Gobblin-Distcp.md
@@ -246,6 +246,12 @@ public interface DatasetsFinder<T extends Dataset> {
     * If the publish steps fail, the write-ahead log is preserved, and Gobblin 
will attempt to run them on the next execution. Relevant directories will not 
be deleted on exit.
 * Eventually, write step should also use exactly-once feature.
 
+## Splitting files into block level granularity work units
+
+Gobblin Distcp has an option to enable splitting of files into block level 
granularity work units, which involves the use of a helper class, 
`DistcpFileSplitter`, which has methods for:
+* Splitting of files into block level work units, which is done at the 
`CopySource`; the block level granularity is represented by an additional 
`Split` construct within each work unit that contains offset and ordering 
information.
+* Merging of block level work units/splits, which is done at the 
`CopyDataPublisher`; this uses calls to the `FileSystem#concat` API to append 
the separately copied entities of each file back together.
+
 # Leverage
 
 Gobblin Distcp leverages Gobblin as its running framework, and most features 
available to Gobblin:
@@ -262,7 +268,7 @@ Gobblin Distcp leverages Gobblin as its running framework, 
and most features ava
 There are two components in the flow:
 
 * File listing and work unit generation: slow if there are too many files. 
Dataset aware optimizations are possible, as well as using services other than 
Hadoop ls call (like lsr or HDFS edit log), so this can be improved and should 
scale with the correct optimizations. Work unit generation is currently a 
serial process handled by Gobblin and could be a bottleneck. If we find it is a 
bottleneck, that process is parallelizable.
-* Actual copy tasks: massively parallel using MR or many containers in YARN. 
Generally, it is the most expensive part of the flow. Although inputs can be 
split, HDFS does not support parallel writing to the same file, so large files 
will be a bottleneck (but this is true with distcp2 as well). This issue will 
be alleviated with the YARN executing model, where WorkUnits are allocated 
dynamically to containers (multiple small files can be copied in one container 
will another container copies a large file), and datasets can be publishes as 
soon as they are ready (remove impact from slow datasets). In direct byte 
copies, we have observed speeds that saturate the available network speed. Byte 
level transformations slow down the process (e.g. decrypting).
+* Actual copy tasks: massively parallel using MR or many containers in YARN. 
Generally, it is the most expensive part of the flow. Although inputs can be 
split, HDFS does not support parallel writing to the same file, so large files 
will be a bottleneck (but this is true with distcp2 as well). This issue will 
be alleviated with the YARN executing model, where WorkUnits are allocated 
dynamically to containers (multiple small files can be copied in one container 
will another container copies a large file), and datasets can be publishes as 
soon as they are ready (remove impact from slow datasets). If this is an issue 
for a job in MR/with HDFS, Gobblin Distcp provides an option to enable 
splitting of files into block level granularity work units to be copied 
independently, then merged back together before publishing, which may help to 
reduce the mapper skew and alleviate the bottleneck. In direct byte copies, we 
have observed speeds that saturate the available network speed. Byte level 
 transformations (e.g. decrypting) slow down the process, and also cannot be 
used with jobs that enable splitting.
 
 # Monitoring and Alerting
 

http://git-wip-us.apache.org/repos/asf/incubator-gobblin/blob/66b1fcda/gobblin-docs/user-guide/Configuration-Properties-Glossary.md
----------------------------------------------------------------------
diff --git a/gobblin-docs/user-guide/Configuration-Properties-Glossary.md 
b/gobblin-docs/user-guide/Configuration-Properties-Glossary.md
index 988991b..46b3651 100644
--- a/gobblin-docs/user-guide/Configuration-Properties-Glossary.md
+++ b/gobblin-docs/user-guide/Configuration-Properties-Glossary.md
@@ -22,6 +22,10 @@ Gobblin also allows you to specify a global configuration 
file that contains com
 * [Email Alert Properties](#Email-Alert-Properties)  
 * [Source Properties](#Source-Properties)  
   * [Common Source Properties](#Common-Source-Properties)  
+  * [Distcp CopySource Properties](#Distcp-CopySource-Properties)
+    * [RecursiveCopyableDataset 
Properties](#RecursiveCopyableDataset-Properties)
+    * [WorkUnitBinPacker Properties](#WorkUnitBinPacker-Properties)
+    * [DistcpFileSplitter Properties](#DistcpFileSplitter-Properties)
   * [QueryBasedExtractor Properties](#QueryBasedExtractor-Properties) 
     * [JdbcExtractor Properties](#JdbcExtractor-Properties)  
   * [FileBasedExtractor Properties](#FileBasedExtractor-Properties)  
@@ -592,6 +596,73 @@ No
 
 ###### Required
 
+## Distcp CopySource Properties <a name="Distcp-CopySource-Properties"></a>
+#### gobblin.copy.simulate
+###### Description
+Will perform copy file listing but doesn't execute actual copy.
+###### Default Value
+False
+###### Required
+No
+#### gobblin.copy.includeEmptyDirectories
+###### Description
+Whether to include empty directories from the source in the copy.
+###### Default Value
+False
+###### Required
+No
+### RecursiveCopyableDataset Properties <a 
name="RecursiveCopyableDataset-Properties"></a>
+#### gobblin.copy.recursive.deleteEmptyDirectories
+###### Description
+Whether to delete newly empty directories found, up to the dataset root.
+###### Default Value
+False
+###### Required
+No
+#### gobblin.copy.recursive.delete
+###### Description
+Whether to delete files in the target that don't exist in the source.
+###### Default Value
+False
+###### Required
+No
+#### gobblin.copy.recursive.update
+###### Description
+Will update files that are different between the source and target, and skip 
files already in the target.
+###### Default Value
+False
+###### Required
+No
+### DistcpFileSplitter Properties <a name="DistcpFileSplitter-Properties"></a>
+#### gobblin.copy.split.enabled
+###### Description
+Will split files into block level granularity work units, which can be copied 
independently, then merged back together before publishing. To actually achieve 
splitting, the max split size property also needs to be set.
+###### Default Value
+False
+###### Required
+No
+#### gobblin.copy.file.max.split.size
+###### Description
+If splitting is enabled, the split size (in bytes) for the block level work 
units is calculated based on rounding down the value of this property to the 
nearest integer multiple of the block size. If the value of this property is 
less than the block size, it gets adjusted up.
+###### Default Value
+Long.MAX_VALUE
+###### Required
+No
+### WorkUnitBinPacker Properties <a name="WorkUnitBinPacker-Properties"></a>
+#### gobblin.copy.binPacking.maxSizePerBin
+###### Description
+Limits the maximum weight that can be packed into a multi work unit produced 
from bin packing. A value of 0 means packing is not done.
+###### Default Value
+0
+###### Required
+No
+#### gobblin.copy.binPacking.maxWorkUnitsPerBin
+###### Description
+Limits the maximum number/amount of work units that can be packed into a multi 
work unit produced from bin packing.
+###### Default Value
+50
+###### Required
+No
 ## QueryBasedExtractor Properties <a name="QueryBasedExtractor-Properties"></a>
 The following table lists the query based extractor configuration properties.
 #### source.querybased.watermark.type

incubator-gobblin git commit: [GOBBLIN-598] Add documentation on split enabled distcp (config glossary & gobblin distcp page)

Reply via email to