olehborysevych commented on code in PR #29507: URL: https://github.com/apache/beam/pull/29507#discussion_r1411130755
########## learning/prompts/documentation-lookup/13_advanced_splittable_dofn.md: ########## @@ -0,0 +1,18 @@ +Prompt: +What is Splittable DoFn in Apache Beam? +Response: +Splittable DoFn (SDF) is a generalization of [DoFn](https://beam.apache.org/documentation/programming-guide/#pardo) that lets you process elements in a non-monolithic way. Splittable DoFn makes it easier to create complex, modular I/O connectors in Beam. +When you apply a splittable DoFn to an element, the runner has the option of splitting the element’s processing into smaller tasks. You can checkpoint the processing of an element, and you can split the remaining work to yield additional parallelism. + +At a high level, an SDF is responsible for processing element and restriction pairs. A restriction represents a subset of work that would have been necessary to have been done when processing the element. + +Executing an [Splittable DoFn](https://beam.apache.org/documentation/programming-guide/#splittable-dofns) follows the following steps: +1. Each element is paired with a restriction (e.g. filename is paired with offset range representing the whole file). +2. Each element and restriction pair is split (e.g. offset ranges are broken up into smaller pieces). +3. The runner redistributes the element and restriction pairs to several workers. +4. Element and restriction pairs are processed in parallel (e.g. the file is read). Within this last step, the element and restriction pair can pause its own processing and/or be split into further element and restriction pairs. + +See Tour of Beam [Splittable DoFn module](https://tour.beam.apache.org/tour/python/splittable-dofn/splittable) for practical example. Review Comment: Done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
