This is an automated email from the ASF dual-hosted git repository.
bhulette pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/beam.git
The following commit(s) were added to refs/heads/master by this push:
new fdf0636 Minor: Add more links to DataFrame API documentation (#15661)
fdf0636 is described below
commit fdf06361cd609335dc6c9763fb09f4e6b3e29e36
Author: Brian Hulette <[email protected]>
AuthorDate: Fri Oct 15 09:32:34 2021 -0700
Minor: Add more links to DataFrame API documentation (#15661)
* Add links to API documentation
* Pandas -> pandas, explicitly link to DataFrame examples
* Drop 'standard'
---
.../en/documentation/dsls/dataframes/overview.md | 21 +++++++++++++--------
.../site/layouts/partials/section-menu/en/sdks.html | 5 ++++-
2 files changed, 17 insertions(+), 9 deletions(-)
diff --git
a/website/www/site/content/en/documentation/dsls/dataframes/overview.md
b/website/www/site/content/en/documentation/dsls/dataframes/overview.md
index 0620da4..c2c9f8f 100644
--- a/website/www/site/content/en/documentation/dsls/dataframes/overview.md
+++ b/website/www/site/content/en/documentation/dsls/dataframes/overview.md
@@ -54,15 +54,15 @@ with beam.Pipeline() as p:
pandas is able to infer column names from the first row of the CSV data, which
is where `passenger_count` and `DOLocationID` come from.
-In this example, the only traditional Beam type is the `Pipeline` instance.
Otherwise the example is written completely with the DataFrame API. This is
possible because the Beam DataFrame API includes its own IO operations (for
example, `read_csv` and `to_csv`) based on the pandas native implementations.
`read_*` and `to_*` operations support file patterns and any Beam-compatible
file system. The grouping is accomplished with a group-by-key, and arbitrary
pandas operations (in this case, [...]
+In this example, the only traditional Beam type is the `Pipeline` instance.
Otherwise the example is written completely with the DataFrame API. This is
possible because the Beam DataFrame API includes its own IO operations (for
example, [`read_csv`][pydoc_read_csv] and [`to_csv`][pydoc_to_csv]) based on
the pandas native implementations. `read_*` and `to_*` operations support file
patterns and any Beam-compatible file system. The grouping is accomplished with
a group-by-key, and arbitrar [...]
-The Beam DataFrame API aims to be compatible with the native pandas
implementation, with a few caveats detailed below in [Differences from standard
pandas](/documentation/dsls/dataframes/differences-from-pandas/).
+The Beam DataFrame API aims to be compatible with the native pandas
implementation, with a few caveats detailed below in [Differences from
pandas](/documentation/dsls/dataframes/differences-from-pandas/).
## Embedding DataFrames in a pipeline
To use the DataFrames API in a larger pipeline, you can convert a PCollection
to a DataFrame, process the DataFrame, and then convert the DataFrame back to a
PCollection. In order to convert a PCollection to a DataFrame and back, you
have to use PCollections that have
[schemas](https://beam.apache.org/documentation/programming-guide/#what-is-a-schema)
attached. A PCollection with a schema attached is also referred to as a
*schema-aware PCollection*. To learn more about attaching a schema [...]
-Here’s an example that creates a schema-aware PCollection, converts it to a
DataFrame using `to_dataframe`, processes the DataFrame, and then converts the
DataFrame back to a PCollection using `to_pcollection`:
+Here’s an example that creates a schema-aware PCollection, converts it to a
DataFrame using [`to_dataframe`][pydoc_to_dataframe], processes the DataFrame,
and then converts the DataFrame back to a PCollection using
[`to_pcollection`][pydoc_to_pcollection]:
<!-- TODO(BEAM-11480): Convert these examples to snippets -->
{{< highlight py >}}
@@ -96,7 +96,7 @@ You can find the full wordcount example on
[GitHub](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/dataframe/wordcount.py),
along with other [example DataFrame
pipelines](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/dataframe/).
-It’s also possible to use the DataFrame API by passing a function to
[`DataframeTransform`][pydoc_dataframe_transform]:
+It’s also possible to use the DataFrame API by passing a function to
[`DataframeTransform`][pydoc_DataframeTransform]:
{{< highlight py >}}
from apache_beam.dataframe.transforms import DataframeTransform
@@ -110,9 +110,9 @@ with beam.Pipeline() as p:
...
{{< /highlight >}}
-[`DataframeTransform`][pydoc_dataframe_transform] is similar to
[`SqlTransform`][pydoc_sql_transform] from the [Beam
SQL](https://beam.apache.org/documentation/dsls/sql/overview/) DSL. Where
`SqlTransform` translates a SQL query to a PTransform, `DataframeTransform` is
a PTransform that applies a function that takes and returns DataFrames. A
`DataframeTransform` can be particularly useful if you have a stand-alone
function that can be called both on Beam and on ordinary pandas DataFrames.
+[`DataframeTransform`][pydoc_DataframeTransform] is similar to
[`SqlTransform`][pydoc_SqlTransform] from the [Beam
SQL](https://beam.apache.org/documentation/dsls/sql/overview/) DSL. Where
[`SqlTransform`][pydoc_SqlTransform] translates a SQL query to a PTransform,
[`DataframeTransform`][pydoc_DataframeTransform] is a PTransform that applies a
function that takes and returns DataFrames. A
[`DataframeTransform`][pydoc_DataframeTransform] can be particularly useful if
you have a stand-alon [...]
-`DataframeTransform` can accept and return multiple PCollections by name and
by keyword, as shown in the following examples:
+[`DataframeTransform`][pydoc_DataframeTransform] can accept and return
multiple PCollections by name and by keyword, as shown in the following
examples:
{{< highlight py >}}
output = (pc1, pc2) | DataframeTransform(lambda df1, df2: ...)
@@ -124,7 +124,12 @@ pc1, pc2 = {'a': pc} | DataframeTransform(lambda a: expr1,
expr2)
{...} = {a: pc} | DataframeTransform(lambda a: {...})
{{< /highlight >}}
-[pydoc_dataframe_transform]:
https://beam.apache.org/releases/pydoc/current/apache_beam.dataframe.transforms.html#apache_beam.dataframe.transforms.DataframeTransform
-[pydoc_sql_transform]:
https://beam.apache.org/releases/pydoc/current/apache_beam.transforms.sql.html#apache_beam.transforms.sql.SqlTransform
+[pydoc_read_csv]:
https://beam.apache.org/releases/pydoc/current/apache_beam.dataframe.io.html#apache_beam.dataframe.io.read_csv
+[pydoc_to_csv]:
https://beam.apache.org/releases/pydoc/current/apache_beam.dataframe.frames.html#apache_beam.dataframe.frames.DeferredDataFrame.to_csv
+[pydoc_sum]:
https://beam.apache.org/releases/pydoc/current/apache_beam.dataframe.frames.html#apache_beam.dataframe.frames.DeferredDataFrame.sum
+[pydoc_DataframeTransform]:
https://beam.apache.org/releases/pydoc/current/apache_beam.dataframe.transforms.html#apache_beam.dataframe.transforms.DataframeTransform
+[pydoc_SqlTransform]:
https://beam.apache.org/releases/pydoc/current/apache_beam.transforms.sql.html#apache_beam.transforms.sql.SqlTransform
+[pydoc_to_dataframe]:
https://beam.apache.org/releases/pydoc/current/apache_beam.dataframe.convert.html#apache_beam.dataframe.convert.to_dataframe
+[pydoc_to_pcollection]:
https://beam.apache.org/releases/pydoc/current/apache_beam.dataframe.convert.html#apache_beam.dataframe.convert.to_pcollection
{{< button-colab
url="https://colab.research.google.com/github/apache/beam/blob/master/examples/notebooks/tour-of-beam/dataframes.ipynb"
>}}
diff --git a/website/www/site/layouts/partials/section-menu/en/sdks.html
b/website/www/site/layouts/partials/section-menu/en/sdks.html
index dbcc3c2..d46e05d 100644
--- a/website/www/site/layouts/partials/section-menu/en/sdks.html
+++ b/website/www/site/layouts/partials/section-menu/en/sdks.html
@@ -112,7 +112,10 @@
<span class="section-nav-list-title">DataFrames</span>
<ul class="section-nav-list">
<li><a href="/documentation/dsls/dataframes/overview/">Overview</a></li>
- <li><a
href="/documentation/dsls/dataframes/differences-from-pandas/">Differences from
Pandas</a></li>
+ <li><a
href="/documentation/dsls/dataframes/differences-from-pandas/">Differences from
pandas</a></li>
+ <li><a
href="https://github.com/apache/beam/tree/master/sdks/python/apache_beam/examples/dataframe"
target="_blank">
+ Example pipelines <img src="/images/external-link-icon.png" width="14"
height="14" alt="External link."></a>
+ </li>
<li><a
href="https://beam.apache.org/releases/pydoc/{{.Site.Params.release_latest
}}/apache_beam.dataframe.html" target="_blank">
DataFrame API reference <img src="/images/external-link-icon.png"
width="14" height="14" alt="External link."></a>
</li>