sryza commented on code in PR #53346:
URL: https://github.com/apache/spark/pull/53346#discussion_r2603509085
##########
docs/declarative-pipelines-programming-guide.md:
##########
@@ -90,118 +94,136 @@ name: my_pipeline
libraries:
- glob:
include: transformations/**
+storage: file:///absolute/path/to/storage/dir
catalog: my_catalog
database: my_db
configuration:
spark.sql.shuffle.partitions: "1000"
```
-It's conventional to name pipeline spec files `spark-pipeline.yml`.
-
The `spark-pipelines init` command, described below, makes it easy to generate
a pipeline project with default configuration and directory structure.
-
## The `spark-pipelines` Command Line Interface
-The `spark-pipelines` command line interface (CLI) is the primary way to
execute a pipeline. It also contains an `init` subcommand for generating a
pipeline project and a `dry-run` subcommand for validating a pipeline.
+The `spark-pipelines` command line interface (CLI) is the primary way to
manage a pipeline.
`spark-pipelines` is built on top of `spark-submit`, meaning that it supports
all cluster managers supported by `spark-submit`. It supports all
`spark-submit` arguments except for `--class`.
### `spark-pipelines init`
-`spark-pipelines init --name my_pipeline` generates a simple pipeline project,
inside a directory named "my_pipeline", including a spec file and example
definitions.
+`spark-pipelines init --name my_pipeline` generates a simple pipeline project,
inside a directory named `my_pipeline`, including a spec file and example
transformation definitions.
### `spark-pipelines run`
-`spark-pipelines run` launches an execution of a pipeline and monitors its
progress until it completes. The `--spec` parameter allows selecting the
pipeline spec file. If not provided, the CLI will look in the current directory
and parent directories for a file named `spark-pipeline.yml` or
`spark-pipeline.yaml`.
+`spark-pipelines run` launches an execution of a pipeline and monitors its
progress until it completes.
+
+The `--spec` parameter allows selecting the pipeline spec file. If not
provided, the CLI will look in the current directory and parent directories for
one of the files:
+
+* `spark-pipeline.yml`
+* `spark-pipeline.yaml`
### `spark-pipelines dry-run`
`spark-pipelines dry-run` launches an execution of a pipeline that doesn't
write or read any data, but catches many kinds of errors that would be caught
if the pipeline were to actually run. E.g.
- Syntax errors – e.g. invalid Python or SQL code
-- Analysis errors – e.g. selecting from a table that doesn't exist or
selecting a column that doesn't exist
+- Analysis errors – e.g. selecting from a table or a column that doesn't exist
- Graph validation errors - e.g. cyclic dependencies
## Programming with SDP in Python
-SDP Python functions are defined in the `pyspark.pipelines` module. Your
pipelines implemented with the Python API must import this module. It's common
to alias the module to `dp` to limit the number of characters you need to type
when using its APIs.
+SDP Python definitions are defined in the `pyspark.pipelines` module.
+
+Your pipelines implemented with the Python API must import this module. It's
recommended to alias the module to `dp`.
```python
from pyspark import pipelines as dp
```
-### Creating a Materialized View with Python
+### Creating Materialized View
-The `@dp.materialized_view` decorator tells SDP to create a materialized view
based on the results returned by a function that performs a batch read:
+The `@dp.materialized_view` decorator tells SDP to create a materialized view
based on the results of a function that performs a batch read:
```python
from pyspark import pipelines as dp
+from pyspark.sql import DataFrame
@dp.materialized_view
-def basic_mv():
+def basic_mv() -> DataFrame:
return spark.table("samples.nyctaxi.trips")
```
-Optionally, you can specify the table name using the `name` argument:
+The name of the materialized view is derived from the name of the function.
+
+You can specify the name of the materialized view using the `name` argument:
```python
from pyspark import pipelines as dp
+from pyspark.sql import DataFrame
@dp.materialized_view(name="trips_mv")
-def basic_mv():
+def basic_mv() -> DataFrame:
return spark.table("samples.nyctaxi.trips")
```
-### Creating a Temporary View with Python
+### Creating Temporary View
-The `@dp.temporary_view` decorator tells SDP to create a temporary view based
on the results returned by a function that performs a batch read:
+The `@dp.temporary_view` decorator tells SDP to create a temporary view based
on the results of a function that performs a batch read:
```python
from pyspark import pipelines as dp
+from pyspark.sql import DataFrame
@dp.temporary_view
-def basic_tv():
+def basic_tv() -> DataFrame:
return spark.table("samples.nyctaxi.trips")
```
This temporary view can be read by other queries within the pipeline, but
can't be read outside the scope of the pipeline.
-### Creating a Streaming Table with Python
+### Creating Streaming Table
-Similarly, you can create a streaming table by using the `@dp.table` decorator
with a function that performs a streaming read:
+You can create a streaming table using the `@dp.table` decorator with a
function that performs a streaming read:
```python
from pyspark import pipelines as dp
+from pyspark.sql import DataFrame
@dp.table
-def basic_st():
+def basic_st() -> DataFrame:
return spark.readStream.table("samples.nyctaxi.trips")
```
-### Loading Data from a Streaming Source
+### Loading Data from Streaming Source
Review Comment:
I think "from a Streaming Source" is more grammatically correct.
##########
docs/declarative-pipelines-programming-guide.md:
##########
@@ -90,118 +94,136 @@ name: my_pipeline
libraries:
- glob:
include: transformations/**
+storage: file:///absolute/path/to/storage/dir
catalog: my_catalog
database: my_db
configuration:
spark.sql.shuffle.partitions: "1000"
```
-It's conventional to name pipeline spec files `spark-pipeline.yml`.
-
The `spark-pipelines init` command, described below, makes it easy to generate
a pipeline project with default configuration and directory structure.
-
## The `spark-pipelines` Command Line Interface
-The `spark-pipelines` command line interface (CLI) is the primary way to
execute a pipeline. It also contains an `init` subcommand for generating a
pipeline project and a `dry-run` subcommand for validating a pipeline.
+The `spark-pipelines` command line interface (CLI) is the primary way to
manage a pipeline.
`spark-pipelines` is built on top of `spark-submit`, meaning that it supports
all cluster managers supported by `spark-submit`. It supports all
`spark-submit` arguments except for `--class`.
### `spark-pipelines init`
-`spark-pipelines init --name my_pipeline` generates a simple pipeline project,
inside a directory named "my_pipeline", including a spec file and example
definitions.
+`spark-pipelines init --name my_pipeline` generates a simple pipeline project,
inside a directory named `my_pipeline`, including a spec file and example
transformation definitions.
### `spark-pipelines run`
-`spark-pipelines run` launches an execution of a pipeline and monitors its
progress until it completes. The `--spec` parameter allows selecting the
pipeline spec file. If not provided, the CLI will look in the current directory
and parent directories for a file named `spark-pipeline.yml` or
`spark-pipeline.yaml`.
+`spark-pipelines run` launches an execution of a pipeline and monitors its
progress until it completes.
+
+The `--spec` parameter allows selecting the pipeline spec file. If not
provided, the CLI will look in the current directory and parent directories for
one of the files:
+
+* `spark-pipeline.yml`
+* `spark-pipeline.yaml`
### `spark-pipelines dry-run`
`spark-pipelines dry-run` launches an execution of a pipeline that doesn't
write or read any data, but catches many kinds of errors that would be caught
if the pipeline were to actually run. E.g.
- Syntax errors – e.g. invalid Python or SQL code
-- Analysis errors – e.g. selecting from a table that doesn't exist or
selecting a column that doesn't exist
+- Analysis errors – e.g. selecting from a table or a column that doesn't exist
- Graph validation errors - e.g. cyclic dependencies
## Programming with SDP in Python
-SDP Python functions are defined in the `pyspark.pipelines` module. Your
pipelines implemented with the Python API must import this module. It's common
to alias the module to `dp` to limit the number of characters you need to type
when using its APIs.
+SDP Python definitions are defined in the `pyspark.pipelines` module.
+
+Your pipelines implemented with the Python API must import this module. It's
recommended to alias the module to `dp`.
```python
from pyspark import pipelines as dp
```
-### Creating a Materialized View with Python
Review Comment:
I would prefer to keep the prior heading names – yes, "Python" is redundant
with the outer heading, but makes it easier for someone scrolling through the
doc to find what they're looking for. And also might help with SEO.
##########
docs/declarative-pipelines-programming-guide.md:
##########
@@ -409,12 +454,15 @@ SELECT * FROM STREAM(customers_us_east);
### Python Considerations
- SDP evaluates the code that defines a pipeline multiple times during
planning and pipeline runs. Python functions that define datasets should
include only the code required to define the table or view.
-- The function used to define a dataset must return a Spark DataFrame.
+- The function used to define a dataset must return a Spark
`pyspark.sql.DataFrame`.
Review Comment:
```suggestion
- The function used to define a dataset must return a
`pyspark.sql.DataFrame`.
```
##########
docs/declarative-pipelines-programming-guide.md:
##########
@@ -409,12 +454,15 @@ SELECT * FROM STREAM(customers_us_east);
### Python Considerations
- SDP evaluates the code that defines a pipeline multiple times during
planning and pipeline runs. Python functions that define datasets should
include only the code required to define the table or view.
-- The function used to define a dataset must return a Spark DataFrame.
+- The function used to define a dataset must return a Spark
`pyspark.sql.DataFrame`.
- Never use methods that save or write to files or tables as part of your SDP
dataset code.
+- When using the `for` loop pattern to define datasets in Python, ensure that
the list of values passed to the `for` loop is always additive.
Review Comment:
I think it might be clearer to actually just leave this out – we don't
discuss a `for` loop pattern above.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]