Re: [PR] [SPARK-54610][SDP][DOCS] Improve SDP Programming Guide [spark]

via GitHub Tue, 09 Dec 2025 09:01:50 -0800


sryza commented on code in PR #53346:
URL: https://github.com/apache/spark/pull/53346#discussion_r2603509085



##########
docs/declarative-pipelines-programming-guide.md:
##########
@@ -90,118 +94,136 @@ name: my_pipeline
 libraries:
   - glob:
       include: transformations/**
+storage: file:///absolute/path/to/storage/dir
 catalog: my_catalog
 database: my_db
 configuration:
   spark.sql.shuffle.partitions: "1000"
 ```
 
-It's conventional to name pipeline spec files `spark-pipeline.yml`.
-
 The `spark-pipelines init` command, described below, makes it easy to generate 
a pipeline project with default configuration and directory structure.
 
-
 ## The `spark-pipelines` Command Line Interface
 
-The `spark-pipelines` command line interface (CLI) is the primary way to 
execute a pipeline. It also contains an `init` subcommand for generating a 
pipeline project and a `dry-run` subcommand for validating a pipeline.
+The `spark-pipelines` command line interface (CLI) is the primary way to 
manage a pipeline.
 
 `spark-pipelines` is built on top of `spark-submit`, meaning that it supports 
all cluster managers supported by `spark-submit`. It supports all 
`spark-submit` arguments except for `--class`.
 
 ### `spark-pipelines init`
 
-`spark-pipelines init --name my_pipeline` generates a simple pipeline project, 
inside a directory named "my_pipeline", including a spec file and example 
definitions.
+`spark-pipelines init --name my_pipeline` generates a simple pipeline project, 
inside a directory named `my_pipeline`, including a spec file and example 
transformation definitions.
 
 ### `spark-pipelines run`
 
-`spark-pipelines run` launches an execution of a pipeline and monitors its 
progress until it completes. The `--spec` parameter allows selecting the 
pipeline spec file. If not provided, the CLI will look in the current directory 
and parent directories for a file named `spark-pipeline.yml` or 
`spark-pipeline.yaml`.
+`spark-pipelines run` launches an execution of a pipeline and monitors its 
progress until it completes.
+
+The `--spec` parameter allows selecting the pipeline spec file. If not 
provided, the CLI will look in the current directory and parent directories for 
one of the files:
+
+* `spark-pipeline.yml`
+* `spark-pipeline.yaml`
 
 ### `spark-pipelines dry-run`
 
 `spark-pipelines dry-run` launches an execution of a pipeline that doesn't 
write or read any data, but catches many kinds of errors that would be caught 
if the pipeline were to actually run. E.g.
 - Syntax errors – e.g. invalid Python or SQL code
-- Analysis errors – e.g. selecting from a table that doesn't exist or 
selecting a column that doesn't exist
+- Analysis errors – e.g. selecting from a table or a column that doesn't exist
 - Graph validation errors - e.g. cyclic dependencies
 
 ## Programming with SDP in Python
 
-SDP Python functions are defined in the `pyspark.pipelines` module. Your 
pipelines implemented with the Python API must import this module. It's common 
to alias the module to `dp` to limit the number of characters you need to type 
when using its APIs.
+SDP Python definitions are defined in the `pyspark.pipelines` module.
+
+Your pipelines implemented with the Python API must import this module. It's 
recommended to alias the module to `dp`.
 
 ```python
 from pyspark import pipelines as dp
 ```
 
-### Creating a Materialized View with Python
+### Creating Materialized View
 
-The `@dp.materialized_view` decorator tells SDP to create a materialized view 
based on the results returned by a function that performs a batch read:
+The `@dp.materialized_view` decorator tells SDP to create a materialized view 
based on the results of a function that performs a batch read:
 
 ```python
 from pyspark import pipelines as dp
+from pyspark.sql import DataFrame
 
 @dp.materialized_view
-def basic_mv():
+def basic_mv() -> DataFrame:
     return spark.table("samples.nyctaxi.trips")
 ```
 
-Optionally, you can specify the table name using the `name` argument:
+The name of the materialized view is derived from the name of the function.
+
+You can specify the name of the materialized view using the `name` argument:
 
 ```python
 from pyspark import pipelines as dp
+from pyspark.sql import DataFrame
 
 @dp.materialized_view(name="trips_mv")
-def basic_mv():
+def basic_mv() -> DataFrame:
     return spark.table("samples.nyctaxi.trips")
 ```
 
-### Creating a Temporary View with Python
+### Creating Temporary View
 
-The `@dp.temporary_view` decorator tells SDP to create a temporary view based 
on the results returned by a function that performs a batch read:
+The `@dp.temporary_view` decorator tells SDP to create a temporary view based 
on the results of a function that performs a batch read:
 
 ```python
 from pyspark import pipelines as dp
+from pyspark.sql import DataFrame
 
 @dp.temporary_view
-def basic_tv():
+def basic_tv() -> DataFrame:
     return spark.table("samples.nyctaxi.trips")
 ```
 
 This temporary view can be read by other queries within the pipeline, but 
can't be read outside the scope of the pipeline.
 
-### Creating a Streaming Table with Python
+### Creating Streaming Table
 
-Similarly, you can create a streaming table by using the `@dp.table` decorator 
with a function that performs a streaming read:
+You can create a streaming table using the `@dp.table` decorator with a 
function that performs a streaming read:
 
 ```python
 from pyspark import pipelines as dp
+from pyspark.sql import DataFrame
 
 @dp.table
-def basic_st():
+def basic_st() -> DataFrame:
     return spark.readStream.table("samples.nyctaxi.trips")
 ```
 
-### Loading Data from a Streaming Source
+### Loading Data from Streaming Source

Review Comment:
   I think "from a Streaming Source" is more grammatically correct.



##########
docs/declarative-pipelines-programming-guide.md:
##########
@@ -90,118 +94,136 @@ name: my_pipeline
 libraries:
   - glob:
       include: transformations/**
+storage: file:///absolute/path/to/storage/dir
 catalog: my_catalog
 database: my_db
 configuration:
   spark.sql.shuffle.partitions: "1000"
 ```
 
-It's conventional to name pipeline spec files `spark-pipeline.yml`.
-
 The `spark-pipelines init` command, described below, makes it easy to generate 
a pipeline project with default configuration and directory structure.
 
-
 ## The `spark-pipelines` Command Line Interface
 
-The `spark-pipelines` command line interface (CLI) is the primary way to 
execute a pipeline. It also contains an `init` subcommand for generating a 
pipeline project and a `dry-run` subcommand for validating a pipeline.
+The `spark-pipelines` command line interface (CLI) is the primary way to 
manage a pipeline.
 
 `spark-pipelines` is built on top of `spark-submit`, meaning that it supports 
all cluster managers supported by `spark-submit`. It supports all 
`spark-submit` arguments except for `--class`.
 
 ### `spark-pipelines init`
 
-`spark-pipelines init --name my_pipeline` generates a simple pipeline project, 
inside a directory named "my_pipeline", including a spec file and example 
definitions.
+`spark-pipelines init --name my_pipeline` generates a simple pipeline project, 
inside a directory named `my_pipeline`, including a spec file and example 
transformation definitions.
 
 ### `spark-pipelines run`
 
-`spark-pipelines run` launches an execution of a pipeline and monitors its 
progress until it completes. The `--spec` parameter allows selecting the 
pipeline spec file. If not provided, the CLI will look in the current directory 
and parent directories for a file named `spark-pipeline.yml` or 
`spark-pipeline.yaml`.
+`spark-pipelines run` launches an execution of a pipeline and monitors its 
progress until it completes.
+
+The `--spec` parameter allows selecting the pipeline spec file. If not 
provided, the CLI will look in the current directory and parent directories for 
one of the files:
+
+* `spark-pipeline.yml`
+* `spark-pipeline.yaml`
 
 ### `spark-pipelines dry-run`
 
 `spark-pipelines dry-run` launches an execution of a pipeline that doesn't 
write or read any data, but catches many kinds of errors that would be caught 
if the pipeline were to actually run. E.g.
 - Syntax errors – e.g. invalid Python or SQL code
-- Analysis errors – e.g. selecting from a table that doesn't exist or 
selecting a column that doesn't exist
+- Analysis errors – e.g. selecting from a table or a column that doesn't exist
 - Graph validation errors - e.g. cyclic dependencies
 
 ## Programming with SDP in Python
 
-SDP Python functions are defined in the `pyspark.pipelines` module. Your 
pipelines implemented with the Python API must import this module. It's common 
to alias the module to `dp` to limit the number of characters you need to type 
when using its APIs.
+SDP Python definitions are defined in the `pyspark.pipelines` module.
+
+Your pipelines implemented with the Python API must import this module. It's 
recommended to alias the module to `dp`.
 
 ```python
 from pyspark import pipelines as dp
 ```
 
-### Creating a Materialized View with Python

Review Comment:
   I would prefer to keep the prior heading names – yes, "Python" is redundant 
with the outer heading, but makes it easier for someone scrolling through the 
doc to find what they're looking for. And also might help with SEO.



##########
docs/declarative-pipelines-programming-guide.md:
##########
@@ -409,12 +454,15 @@ SELECT * FROM STREAM(customers_us_east);
 ### Python Considerations
 
 - SDP evaluates the code that defines a pipeline multiple times during 
planning and pipeline runs. Python functions that define datasets should 
include only the code required to define the table or view.
-- The function used to define a dataset must return a Spark DataFrame.
+- The function used to define a dataset must return a Spark 
`pyspark.sql.DataFrame`.

Review Comment:
   ```suggestion
   - The function used to define a dataset must return a 
`pyspark.sql.DataFrame`.
   ```



##########
docs/declarative-pipelines-programming-guide.md:
##########
@@ -409,12 +454,15 @@ SELECT * FROM STREAM(customers_us_east);
 ### Python Considerations
 
 - SDP evaluates the code that defines a pipeline multiple times during 
planning and pipeline runs. Python functions that define datasets should 
include only the code required to define the table or view.
-- The function used to define a dataset must return a Spark DataFrame.
+- The function used to define a dataset must return a Spark 
`pyspark.sql.DataFrame`.
 - Never use methods that save or write to files or tables as part of your SDP 
dataset code.
+- When using the `for` loop pattern to define datasets in Python, ensure that 
the list of values passed to the `for` loop is always additive.

Review Comment:
   I think it might be clearer to actually just leave this out – we don't 
discuss a `for` loop pattern above.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-54610][SDP][DOCS] Improve SDP Programming Guide [spark]

Reply via email to