Re: [PR] Update pypi documentation 30145 [beam]

via GitHub Sat, 10 May 2025 05:30:49 -0700


liferoad commented on code in PR #34329:
URL: https://github.com/apache/beam/pull/34329#discussion_r2083131909



##########
sdks/python/README.md:
##########
@@ -0,0 +1,135 @@
+# Apache Beam
+
+[Apache Beam](http://beam.apache.org/) is a unified model for defining both 
batch and streaming data-parallel processing pipelines, as well as a set of 
language-specific SDKs for constructing pipelines and Runners for executing 
them on distributed processing backends, including [Apache 
Flink](http://flink.apache.org/), [Apache Spark](http://spark.apache.org/), 
[Google Cloud Dataflow](http://cloud.google.com/dataflow/), and [Hazelcast 
Jet](https://jet.hazelcast.org/).
+
+
+## Overview
+
+Beam provides a general approach to expressing [embarrassingly 
parallel](https://en.wikipedia.org/wiki/Embarrassingly_parallel) data 
processing pipelines and supports three categories of users, each of which have 
relatively disparate backgrounds and needs.
+
+1. _End Users_: Writing pipelines with an existing SDK, running it on an 
existing runner. These users want to focus on writing their application logic 
and have everything else just work.
+2. _SDK Writers_: Developing a Beam SDK targeted at a specific user community 
(Java, Python, Scala, Go, R, graphical, etc). These users are language geeks 
and would prefer to be shielded from all the details of various runners and 
their implementations.
+3. _Runner Writers_: Have an execution environment for distributed processing 
and would like to support programs written against the Beam Model. Would prefer 
to be shielded from details of multiple SDKs.
+
+
+### The Beam Model
+
+The model behind Beam evolved from several internal Google data processing 
projects, including 
[MapReduce](http://research.google.com/archive/mapreduce.html), 
[FlumeJava](http://research.google.com/pubs/pub35650.html), and 
[Millwheel](http://research.google.com/pubs/pub41378.html). This model was 
originally known as the “[Dataflow 
Model](http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf)”.
+
+To learn more about the Beam Model (though still under the original name of 
Dataflow), see the World Beyond Batch: [Streaming 
101](https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101) and 
[Streaming 
102](https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102) posts 
on O’Reilly’s Radar site, and the [VLDB 2015 
paper](http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf).
+
+The key concepts in the Beam programming model are:
+
+* `PCollection`: represents a collection of data, which could be bounded or 
unbounded in size.
+* `PTransform`: represents a computation that transforms input PCollections 
into output PCollections.
+* `Pipeline`: manages a directed acyclic graph of PTransforms and PCollections 
that is ready for execution.
+* `PipelineRunner`: specifies where and how the pipeline should execute.
+
+### Runners
+
+Beam supports executing programs on multiple distributed processing backends 
through PipelineRunners. Currently, the following PipelineRunners are available:
+
+- The `DirectRunner` runs the pipeline on your local machine.
+- The `PrismRunner` runs the pipeline on your local machine using Beam 
Portability.
+- The `DataflowRunner` submits the pipeline to the [Google Cloud 
Dataflow](http://cloud.google.com/dataflow/).
+- The `FlinkRunner` runs the pipeline on an Apache Flink cluster. The code has 
been donated from 
[dataArtisans/flink-dataflow](https://github.com/dataArtisans/flink-dataflow) 
and is now part of Beam.
+- The `SparkRunner` runs the pipeline on an Apache Spark cluster.
+- The `JetRunner` runs the pipeline on a Hazelcast Jet cluster. The code has 
been donated from 
[hazelcast/hazelcast-jet](https://github.com/hazelcast/hazelcast-jet) and is 
now part of Beam.
+- The `Twister2Runner` runs the pipeline on a Twister2 cluster. The code has 
been donated from [DSC-SPIDAL/twister2](https://github.com/DSC-SPIDAL/twister2) 
and is now part of Beam.
+
+Have ideas for new Runners? See the [runner-ideas 
label](https://github.com/apache/beam/issues?q=is%3Aopen+is%3Aissue+label%3Arunner-ideas).
+
+
+## Get started with the Python SDK
+
+Get started with the [Beam Python SDK quickstart](/get-started/quickstart-py) 
to set up your Python development environment, get the Beam SDK for Python, and 
run an example pipeline. Then, read through the [Beam programming 
guide](/documentation/programming-guide) to learn the basic concepts that apply 
to all SDKs in Beam.
+
+See the [Python API reference](https://beam.apache.org/releases/pydoc/) for 
more information on individual APIs.
+
+## Python streaming pipelines
+
+Python [streaming pipeline execution](/documentation/sdks/python-streaming)

Review Comment:
   I think we should use the full url paths.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@beam.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] Update pypi documentation 30145 [beam]

Reply via email to