This is an automated email from the ASF dual-hosted git repository.
git-site-role pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/beam.git
The following commit(s) were added to refs/heads/asf-site by this push:
new 985b91c Publishing website 2021/12/08 00:03:29 at commit c1a898f
985b91c is described below
commit 985b91c598107da0cf1110552abdf9952c020ae7
Author: jenkins <[email protected]>
AuthorDate: Wed Dec 8 00:03:29 2021 +0000
Publishing website 2021/12/08 00:03:29 at commit c1a898f
---
.../dsls/dataframes/overview/index.html | 75 ++++++++++++----------
website/generated-content/sitemap.xml | 2 +-
2 files changed, 42 insertions(+), 35 deletions(-)
diff --git
a/website/generated-content/documentation/dsls/dataframes/overview/index.html
b/website/generated-content/documentation/dsls/dataframes/overview/index.html
index 71d2550..aeda749 100644
---
a/website/generated-content/documentation/dsls/dataframes/overview/index.html
+++
b/website/generated-content/documentation/dsls/dataframes/overview/index.html
@@ -20,52 +20,59 @@ function endSearch(){var
search=document.querySelector(".searchBar");search.clas
function blockScroll(){$("body").toggleClass("fixedPosition");}
function openMenu(){addPlaceholder();blockScroll();}</script><div
class="clearfix container-main-content"><div class="section-nav closed"
data-offset-top=90 data-offset-bottom=500><span class="section-nav-back
glyphicon glyphicon-menu-left"></span><nav><ul class=section-nav-list
data-section-nav><li><span
class=section-nav-list-main-title>Languages</span></li><li><span
class=section-nav-list-title>Java</span><ul class=section-nav-list><li><a
href=/documentation/sdks/java/>Java SDK overvi [...]
Run in Colab</a></td></table><p><br><br><br><br></p><p>The Apache Beam Python
SDK provides a DataFrame API for working with pandas-like <a
href=https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html>DataFrame</a>
objects. The feature lets you convert a PCollection to a DataFrame and then
interact with the DataFrame using the standard methods available on the pandas
DataFrame API. The DataFrame API is built on top of the pandas implementation,
and pandas DataFram [...]
-</code></pre><p>Note that the <em>same</em> <code>pandas</code> version should
be installed on workers when executing DataFrame API pipelines on distributed
runners. Reference <a
href=https://github.com/apache/beam/blob/master/sdks/python/container/base_image_requirements.txt><code>base_image_requirements.txt</code></a>
for the Beam release you are using to see what version of <code>pandas</code>
will be used by default on workers.</p><h2 id=using-dataframes>Using
DataFrames</h2><p>You c [...]
+</code></pre><p>Note that the <em>same</em> <code>pandas</code> version should
be installed on workers when executing DataFrame API pipelines on distributed
runners. Reference <a
href=https://github.com/apache/beam/blob/master/sdks/python/container/py38/base_image_requirements.txt><code>base_image_requirements.txt</code></a>
for the Python version and Beam release you are using to see what version of
<code>pandas</code> will be used by default on workers.</p><h2
id=using-dataframes>Using [...]
-<span class=k>with</span> <span class=n>pipeline</span> <span
class=k>as</span> <span class=n>p</span><span class=p>:</span>
- <span class=n>rides</span> <span class=o>=</span> <span class=n>p</span>
<span class=o>|</span> <span class=n>read_csv</span><span class=p>(</span><span
class=n>input_path</span><span class=p>)</span>
+with pipeline as p:
+ rides = p | read_csv(input_path)
- <span class=c1># Count the number of passengers dropped off per
LocationID</span>
- <span class=n>agg</span> <span class=o>=</span> <span
class=n>rides</span><span class=o>.</span><span class=n>groupby</span><span
class=p>(</span><span class=s1>'DOLocationID'</span><span
class=p>)</span><span class=o>.</span><span class=n>passenger_count</span><span
class=o>.</span><span class=n>sum</span><span class=p>()</span>
- <span class=n>agg</span><span class=o>.</span><span
class=n>to_csv</span><span class=p>(</span><span
class=n>output_path</span><span
class=p>)</span></code></pre></div></div></div><p>pandas is able to infer
column names from the first row of the CSV data, which is where
<code>passenger_count</code> and <code>DOLocationID</code> come from.</p><p>In
this example, the only traditional Beam type is the <code>Pipeline</code>
instance. Otherwise the example is written completely with the Dat [...]
-<span class=kn>from</span> <span class=nn>apache_beam.dataframe.convert</span>
<span class=kn>import</span> <span class=n>to_pcollection</span>
-<span class=o>...</span>
+ # Count the number of passengers dropped off per LocationID
+ agg = rides.groupby('DOLocationID').passenger_count.sum()
+ agg.to_csv(output_path)
+</code></pre><p>pandas is able to infer column names from the first row of the
CSV data, which is where <code>passenger_count</code> and
<code>DOLocationID</code> come from.</p><p>In this example, the only
traditional Beam type is the <code>Pipeline</code> instance. Otherwise the
example is written completely with the DataFrame API. This is possible because
the Beam DataFrame API includes its own IO operations (for example, <a
href=https://beam.apache.org/releases/pydoc/current/apache_be [...]
+from apache_beam.dataframe.convert import to_pcollection
+...
- <span class=c1># Read the text file[pattern] into a PCollection.</span>
- <span class=n>lines</span> <span class=o>=</span> <span class=n>p</span>
<span class=o>|</span> <span class=s1>'Read'</span> <span
class=o>>></span> <span class=n>ReadFromText</span><span
class=p>(</span><span class=n>known_args</span><span class=o>.</span><span
class=n>input</span><span class=p>)</span>
- <span class=n>words</span> <span class=o>=</span> <span class=p>(</span>
- <span class=n>lines</span>
- <span class=o>|</span> <span class=s1>'Split'</span> <span
class=o>>></span> <span class=n>beam</span><span class=o>.</span><span
class=n>FlatMap</span><span class=p>(</span>
- <span class=k>lambda</span> <span class=n>line</span><span
class=p>:</span> <span class=n>re</span><span class=o>.</span><span
class=n>findall</span><span class=p>(</span><span class=sa>r</span><span
class=s1>'[\w]+'</span><span class=p>,</span> <span
class=n>line</span><span class=p>))</span><span class=o>.</span><span
class=n>with_output_types</span><span class=p>(</span><span
class=nb>str</span><span class=p>)</span>
- <span class=c1># Map to Row objects to generate a schema suitable for
conversion</span>
- <span class=c1># to a dataframe.</span>
- <span class=o>|</span> <span class=s1>'ToRows'</span> <span
class=o>>></span> <span class=n>beam</span><span class=o>.</span><span
class=n>Map</span><span class=p>(</span><span class=k>lambda</span> <span
class=n>word</span><span class=p>:</span> <span class=n>beam</span><span
class=o>.</span><span class=n>Row</span><span class=p>(</span><span
class=n>word</span><span class=o>=</span><span class=n>word</span><span
class=p>)))</span>
+ # Read the text file[pattern] into a PCollection.
+ lines = p | 'Read' >> ReadFromText(known_args.input)
- <span class=n>df</span> <span class=o>=</span> <span
class=n>to_dataframe</span><span class=p>(</span><span
class=n>words</span><span class=p>)</span>
- <span class=n>df</span><span class=p>[</span><span
class=s1>'count'</span><span class=p>]</span> <span class=o>=</span>
<span class=mi>1</span>
- <span class=n>counted</span> <span class=o>=</span> <span
class=n>df</span><span class=o>.</span><span class=n>groupby</span><span
class=p>(</span><span class=s1>'word'</span><span class=p>)</span><span
class=o>.</span><span class=n>sum</span><span class=p>()</span>
- <span class=n>counted</span><span class=o>.</span><span
class=n>to_csv</span><span class=p>(</span><span class=n>known_args</span><span
class=o>.</span><span class=n>output</span><span class=p>)</span>
+ words = (
+ lines
+ | 'Split' >> beam.FlatMap(
+ lambda line: re.findall(r'[\w]+',
line)).with_output_types(str)
+ # Map to Row objects to generate a schema suitable for conversion
+ # to a dataframe.
+ | 'ToRows' >> beam.Map(lambda word: beam.Row(word=word)))
- <span class=c1># Deferred DataFrames can also be converted back to
schema'd PCollections</span>
- <span class=n>counted_pc</span> <span class=o>=</span> <span
class=n>to_pcollection</span><span class=p>(</span><span
class=n>counted</span><span class=p>,</span> <span
class=n>include_indexes</span><span class=o>=</span><span
class=bp>True</span><span class=p>)</span></code></pre></div></div></div><p>You
can find the full wordcount example on
+ df = to_dataframe(words)
+ df['count'] = 1
+ counted = df.groupby('word').sum()
+ counted.to_csv(known_args.output)
+
+ # Deferred DataFrames can also be converted back to schema'd
PCollections
+ counted_pc = to_pcollection(counted, include_indexes=True)
+
+
+</code></pre><p>You can find the full wordcount example on
<a
href=https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/dataframe/wordcount.py>GitHub</a>,
-along with other <a
href=https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/dataframe/>example
DataFrame pipelines</a>.</p><p>It’s also possible to use the DataFrame API by
passing a function to <a
href=https://beam.apache.org/releases/pydoc/current/apache_beam.dataframe.transforms.html#apache_beam.dataframe.transforms.DataframeTransform><code>DataframeTransform</code></a>:</p><div
class="language-py snippet"><div class="notebook-skip code-snippet"><a
class=copy [...]
+along with other <a
href=https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/dataframe/>example
DataFrame pipelines</a>.</p><p>It’s also possible to use the DataFrame API by
passing a function to <a
href=https://beam.apache.org/releases/pydoc/current/apache_beam.dataframe.transforms.html#apache_beam.dataframe.transforms.DataframeTransform><code>DataframeTransform</code></a>:</p><pre><code>from
apache_beam.dataframe.transforms import DataframeTransform
-<span class=k>with</span> <span class=n>beam</span><span class=o>.</span><span
class=n>Pipeline</span><span class=p>()</span> <span class=k>as</span> <span
class=n>p</span><span class=p>:</span>
- <span class=o>...</span>
- <span class=o>|</span> <span class=n>beam</span><span class=o>.</span><span
class=n>Select</span><span class=p>(</span><span
class=n>DOLocationID</span><span class=o>=</span><span class=k>lambda</span>
<span class=n>line</span><span class=p>:</span> <span class=nb>int</span><span
class=p>(</span><span class=o>..</span><span class=p>),</span>
- <span class=n>passenger_count</span><span
class=o>=</span><span class=k>lambda</span> <span class=n>line</span><span
class=p>:</span> <span class=nb>int</span><span class=p>(</span><span
class=o>..</span><span class=p>))</span>
- <span class=o>|</span> <span class=n>DataframeTransform</span><span
class=p>(</span><span class=k>lambda</span> <span class=n>df</span><span
class=p>:</span> <span class=n>df</span><span class=o>.</span><span
class=n>groupby</span><span class=p>(</span><span
class=s1>'DOLocationID'</span><span class=p>)</span><span
class=o>.</span><span class=n>sum</span><span class=p>())</span>
- <span class=o>|</span> <span class=n>beam</span><span class=o>.</span><span
class=n>Map</span><span class=p>(</span><span class=k>lambda</span> <span
class=n>row</span><span class=p>:</span> <span class=n>f</span><span
class=s2>"{row.DOLocationID},{row.passenger_count}"</span><span
class=p>)</span>
- <span class=o>...</span></code></pre></div></div></div><p><a
href=https://beam.apache.org/releases/pydoc/current/apache_beam.dataframe.transforms.html#apache_beam.dataframe.transforms.DataframeTransform><code>DataframeTransform</code></a>
is similar to <a
href=https://beam.apache.org/releases/pydoc/current/apache_beam.transforms.sql.html#apache_beam.transforms.sql.SqlTransform><code>SqlTransform</code></a>
from the <a href=https://beam.apache.org/documentation/dsls/sql/overview/>Beam
S [...]
+with beam.Pipeline() as p:
+ ...
+ | beam.Select(DOLocationID=lambda line: int(..),
+ passenger_count=lambda line: int(..))
+ | DataframeTransform(lambda df: df.groupby('DOLocationID').sum())
+ | beam.Map(lambda row: f"{row.DOLocationID},{row.passenger_count}")
+ ...
+</code></pre><p><a
href=https://beam.apache.org/releases/pydoc/current/apache_beam.dataframe.transforms.html#apache_beam.dataframe.transforms.DataframeTransform><code>DataframeTransform</code></a>
is similar to <a
href=https://beam.apache.org/releases/pydoc/current/apache_beam.transforms.sql.html#apache_beam.transforms.sql.SqlTransform><code>SqlTransform</code></a>
from the <a href=https://beam.apache.org/documentation/dsls/sql/overview/>Beam
SQL</a> DSL. Where <a href=https://beam.apach [...]
-<span class=n>output</span> <span class=o>=</span> <span class=p>{</span><span
class=s1>'a'</span><span class=p>:</span> <span class=n>pc</span><span
class=p>,</span> <span class=o>...</span><span class=p>}</span> <span
class=o>|</span> <span class=n>DataframeTransform</span><span
class=p>(</span><span class=k>lambda</span> <span class=n>a</span><span
class=p>,</span> <span class=o>...</span><span class=p>:</span> <span
class=o>...</span><span class=p>)</span>
+output = {'a': pc, ...} | DataframeTransform(lambda a, ...: ...)
-<span class=n>pc1</span><span class=p>,</span> <span class=n>pc2</span> <span
class=o>=</span> <span class=p>{</span><span class=s1>'a'</span><span
class=p>:</span> <span class=n>pc</span><span class=p>}</span> <span
class=o>|</span> <span class=n>DataframeTransform</span><span
class=p>(</span><span class=k>lambda</span> <span class=n>a</span><span
class=p>:</span> <span class=n>expr1</span><span class=p>,</span> <span
class=n>expr2</span><span class=p>)</span>
+pc1, pc2 = {'a': pc} | DataframeTransform(lambda a: expr1, expr2)
-<span class=p>{</span><span class=o>...</span><span class=p>}</span> <span
class=o>=</span> <span class=p>{</span><span class=n>a</span><span
class=p>:</span> <span class=n>pc</span><span class=p>}</span> <span
class=o>|</span> <span class=n>DataframeTransform</span><span
class=p>(</span><span class=k>lambda</span> <span class=n>a</span><span
class=p>:</span> <span class=p>{</span><span class=o>...</span><span
class=p>})</span></code></pre></div></div></div><table align=left><td><a class
[...]
+{...} = {a: pc} | DataframeTransform(lambda a: {...})
+</code></pre><table align=left><td><a class=button target=_blank
href=https://colab.research.google.com/github/apache/beam/blob/master/examples/notebooks/tour-of-beam/dataframes.ipynb><img
alt="Run in Colab" width=32px height=32px
src=https://github.com/googlecolab/open_in_colab/raw/master/images/icon32.png>
Run in Colab</a></td></table><p><br><br><br><br></p></div></div><footer
class=footer><div class=footer__contained><div class=footer__cols><div
class="footer__cols__col footer__cols__col__logos"><div
class=footer__cols__col__logo><img src=/images/beam_logo_circle.svg
class=footer__logo alt="Beam logo"></div><div
class=footer__cols__col__logo><img src=/images/apache_logo_circle.svg
class=footer__logo alt="Apache logo"></div></div><div class=footer-wrapper><div
class=wrapper-grid><div class [...]
<a href=http://www.apache.org>The Apache Software Foundation</a>
| <a href=/privacy_policy>Privacy Policy</a>
diff --git a/website/generated-content/sitemap.xml
b/website/generated-content/sitemap.xml
index 5047cde..bd28479 100644
--- a/website/generated-content/sitemap.xml
+++ b/website/generated-content/sitemap.xml
@@ -1 +1 @@
-<?xml version="1.0" encoding="utf-8" standalone="yes"?><urlset
xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:xhtml="http://www.w3.org/1999/xhtml"><url><loc>/blog/beam-2.34.0/</loc><lastmod>2021-11-11T11:07:06-08:00</lastmod></url><url><loc>/categories/blog/</loc><lastmod>2021-11-11T11:07:06-08:00</lastmod></url><url><loc>/blog/</loc><lastmod>2021-11-11T11:07:06-08:00</lastmod></url><url><loc>/categories/</loc><lastmod>2021-12-01T21:32:04+03:00</lastmod></url><url><loc>/blog/g
[...]
\ No newline at end of file
+<?xml version="1.0" encoding="utf-8" standalone="yes"?><urlset
xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:xhtml="http://www.w3.org/1999/xhtml"><url><loc>/blog/beam-2.34.0/</loc><lastmod>2021-11-11T11:07:06-08:00</lastmod></url><url><loc>/categories/blog/</loc><lastmod>2021-11-11T11:07:06-08:00</lastmod></url><url><loc>/blog/</loc><lastmod>2021-11-11T11:07:06-08:00</lastmod></url><url><loc>/categories/</loc><lastmod>2021-12-01T21:32:04+03:00</lastmod></url><url><loc>/blog/g
[...]
\ No newline at end of file