Github user srowen commented on a diff in the pull request:
https://github.com/apache/spark/pull/16856#discussion_r104256037
--- Diff: docs/quick-start.md ---
@@ -137,37 +138,24 @@ res6: Array[(String, Int)] = Array((means,1),
(under,2), (this,3), (Because,1),
<div data-lang="python" markdown="1">
{% highlight python %}
->>> textFile.map(lambda line: len(line.split())).reduce(lambda a, b: a if
(a > b) else b)
-15
+>>> from pyspark.sql.functions import *
+>>> textFile.select(size(split(textFile.value,
"\s+")).name("numWords")).agg(max(col("numWords"))).collect()
+[Row(max(numWords)=15)]
{% endhighlight %}
-This first maps a line to an integer value, creating a new RDD. `reduce`
is called on that RDD to find the largest line count. The arguments to `map`
and `reduce` are Python [anonymous functions
(lambdas)](https://docs.python.org/2/reference/expressions.html#lambda),
-but we can also pass any top-level Python function we want.
-For example, we'll define a `max` function to make this code easier to
understand:
-
-{% highlight python %}
->>> def max(a, b):
-... if a > b:
-... return a
-... else:
-... return b
-...
-
->>> textFile.map(lambda line: len(line.split())).reduce(max)
-15
-{% endhighlight %}
+This first maps a line to an integer value and alias it with "numWords",
creating a new DataFrame. `agg` is called on that DataFrame to find the largest
word count. The arguments to `select` and `agg` are both
_[Column](api/python/index.html#pyspark.sql.Column)_, we can use `df.colName`
to get a column from a DataFrame, we can also import pyspark.sql.functions,
which provides a lot of convenient functions to build a new Column from an old
one.
--- End diff --
alias it with "numWords" -> aliases it as "numWords"
The clauses beginning ", we can" should be new sentences?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]