youngoli commented on a change in pull request #11517:
URL: https://github.com/apache/beam/pull/11517#discussion_r416170107



##########
File path: sdks/go/pkg/beam/pardo.go
##########
@@ -222,6 +222,81 @@ func ParDo0(s Scope, dofn interface{}, col PCollection, 
opts ...Option) {
 // DoFn instance via output PCollections, in the absence of external
 // communication mechanisms written by user code.
 //
+// Splittable DoFns
+//
+// Splittable DoFns are DoFns that are able to split work within an element,
+// as opposed to only at element boundaries like normal DoFns. This is useful
+// for DoFns that emit many outputs per input element and can distribute that
+// work among multiple workers. The most common examples of this are sources.
+//
+// In order to split work within an element, splittable DoFns use the concept 
of
+// restrictions, which are objects that are associated with an element and
+// describe a portion of work on that element. For example, a restriction
+// associated with a filename might describe what byte range within that file 
to
+// process. In addition to restrictions, splittable DoFns also rely on
+// restriction trackers to track progress and perform splits on a restriction
+// currently being processed. See the `RTracker` interface in core/sdf/sdf.go
+// for more details.
+//
+// Splitting
+//
+// Splitting means taking one restriction and splitting into two or more that
+// cover the entire input space of the original one. In other words, processing
+// all the split restrictions should produce identical output to processing
+// the original one.
+//
+// Splitting occurs in two stages. The initial splitting occurs before any
+// restrictions have started processing. This step is used to split large
+// restrictions into smaller ones that can then be distributed among multiple
+// workers for processing. Initial splitting is user-defined and optional.
+//
+// Dynamic splitting occurs during the processing of a restriction in runners
+// that have implemented it. If there are available workers, runners may split
+// the unprocessed portion of work from a busy worker and shard it to available
+// workers in order to better distribute work. With unsplittable DoFns this can
+// only occur on element boundaries, but for splittable DoFns this split
+// can land within a restriction and will require splitting that restriction.
+//
+// * Note: The Go SDK currently does not support dynamic splitting for SDFs,
+//   only initial splitting. Work can only be split at element boundaries.
+//
+// Splittable DoFn Methods
+//
+// Making a splittable DoFn requires the following methods to be implemented on
+// a DoFn in addition to the usual DoFn requirements. In the following
+// method signatures `elem` represents the main input elements to the DoFn, and
+// should match the types used in ProcessElement. `restriction` represents the
+// user-defined restriction, and can be any type as long as it is consistent
+// throughout all the splittable DoFn methods:
+//
+// * `CreateInitialRestriction(element) restriction`
+//     CreateInitialRestriction creates an initial restriction encompassing an
+//     entire element. The restriction created stays associated with the 
element
+//     it describes.
+// * `SplitRestriction(elem, restriction) []restriction`
+//     SplitRestriction takes an element and its initial restriction, and
+//     optionally performs an initial split on it, returning a slice of all the
+//     split restrictions. If no splits are desired, the method returns a slice
+//     containing only the original restriction. This method will always be
+//     called on each newly created restriction before they are processed.
+// * `RestrictionSize(elem, restriction) float64`
+//     RestrictionSize returns a cheap size estimation for a restriction. This
+//     size is an abstract scalar value that represents how much work a
+//     restriction takes compared to other restrictions in the same DoFn. For
+//     example, a size of 200 represents twice as much work as a size of
+//     100, but the numbers do not represent anything on their own. Size is
+//     used by runners to estimate work for liquid sharding.
+// * `CreateTracker(restriction) sdf.RTracker`

Review comment:
       Ah, good point. It's supposed to be a concrete type implementing the 
sdf.RTracker interface. I'll update this to mention that explicitly.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to