melap commented on a change in pull request #15778:
URL: https://github.com/apache/beam/pull/15778#discussion_r745222397
##########
File path: website/www/site/content/en/documentation/basics.md
##########
@@ -365,6 +373,76 @@ For more information about runners, see the following
pages:
* [Choosing a Runner](/documentation/#choosing-a-runner)
* [Beam Capability Matrix](/documentation/runners/capability-matrix/)
+### Window
+
+Windowing subdivides a `PCollection` into _windows_ according to the timestamps
+of its individual elements. Windows enable grouping operations over unbounded
+collections by dividing the collection into windows of finite collections. A
+windowing function tells the runner how to assign elements to an initial
window,
+and how to merge windows of grouped elements. Two concepts are closely related
+to windowing: [watermarks](#watermark) and triggers.
+
+Transforms that aggregate multiple elements, such as `GroupByKey` and
`Combine`,
Review comment:
š
##########
File path: website/www/site/content/en/documentation/basics.md
##########
@@ -42,6 +42,14 @@ understand an important set of core concepts:
them to a runner.
* [_Runner_](#runner) - A runner runs a Beam pipeline using the capabilities
of
your chosen data processing engine.
+ * [_Window_](#window) - A `PCollection` can be subdivided into windows based
on
Review comment:
Yeah, I've tried to keep things very simple here, and let the
programming guide content cover the gory details. I'll leave it for now unless
there are objections, and we can always tweak it later if it's causing
confusion.
##########
File path: website/www/site/content/en/documentation/basics.md
##########
@@ -365,6 +373,76 @@ For more information about runners, see the following
pages:
* [Choosing a Runner](/documentation/#choosing-a-runner)
* [Beam Capability Matrix](/documentation/runners/capability-matrix/)
+### Window
+
+Windowing subdivides a `PCollection` into _windows_ according to the timestamps
+of its individual elements. Windows enable grouping operations over unbounded
+collections by dividing the collection into windows of finite collections. A
+windowing function tells the runner how to assign elements to an initial
window,
+and how to merge windows of grouped elements. Two concepts are closely related
+to windowing: [watermarks](#watermark) and triggers.
+
+Transforms that aggregate multiple elements, such as `GroupByKey` and
`Combine`,
+work implicitly on a per-window basis; they process each `PCollection` as a
+succession of multiple, finite windows, though the entire collection itself may
+be of unbounded size.
+
+Beam provides several windowing functions:
+
+ * **Fixed time windows** (also known as "tumbling windows") represent a
consistent
+ duration, non overlapping time interval in the data stream.
+ * **Sliding time windows** (also known as "hopping windows") also represent
time
+ intervals in the data stream; however, sliding time windows can overlap.
+ * **Per-session windows** define windows that contain elements that are
within a
+ certain gap duration of another element.
+ * **Single global window**: by default, all data in a `PCollection` is
assigned to
+ the single global window, and late data is discarded.
+ * **Calendar-based windows** (not supported by the Beam SDK for Python)
+
+You can also define your own windowing function if you have more complex
+requirements.
+
+For more information about windows, see the following page:
+
+ * [Beam Programming Guide:
Windowing](/documentation/programming-guide/#windowing)
+
+### Watermark
+
+In any data processing system, there is a certain amount of lag between the
time
+a data event occurs (the āevent timeā, determined by the timestamp on the data
+element itself) and the time the actual data element gets processed at any
stage
+in your pipeline (the āprocessing timeā, determined by the clock on the system
+processing the element). In addition, there are no guarantees that data events
Review comment:
Thanks, I rearranged this a bit. Added your example suggestions, moved
the windowing example up to the windowing section (not sure how it got down
here), and moved the "isn't always guaranteed" sentence up to the intro
paragraph.
##########
File path: website/www/site/content/en/documentation/basics.md
##########
@@ -365,6 +373,76 @@ For more information about runners, see the following
pages:
* [Choosing a Runner](/documentation/#choosing-a-runner)
* [Beam Capability Matrix](/documentation/runners/capability-matrix/)
+### Window
+
+Windowing subdivides a `PCollection` into _windows_ according to the timestamps
+of its individual elements. Windows enable grouping operations over unbounded
+collections by dividing the collection into windows of finite collections. A
+windowing function tells the runner how to assign elements to an initial
window,
Review comment:
This brings up a general question I've been looking at regarding
elements in multiple windows. The docs seem to have (on the surface at least)
contradictory statements on how many windows an element can be in.
From existing section in
https://beam.apache.org/documentation/basics/#windowed-elements :
**No element resides in multiple windows**; two elements can be equal except
for their window, but they are not the same.
From
https://beam.apache.org/documentation/programming-guide/#windowing-basics :
Each element in a PCollection is **assigned to one or more windows**
according to the PCollection's windowing function
From
https://beam.apache.org/documentation/programming-guide/#sliding-time-windows :
Because multiple windows overlap, most elements in a data set will belong to
**more than one window**.
This suggestion is another in the "one or more" column.
However I've also heard that an element that falls into two different
windows is actually considered two separate elements.
What's the most accurate explanation here? Do you have a suggestion as to
which way to document this?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]