[GitHub] spark pull request #16856: [SPARK-19516][DOC] update public doc to use Spark...

srowen Fri, 03 Mar 2017 14:44:52 -0800

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16856#discussion_r104255711
  
    --- Diff: docs/quick-start.md ---
    @@ -29,28 +30,28 @@ or Python. Start it by running the following in the 
Spark directory:
     
         ./bin/spark-shell
     
    -Spark's primary abstraction is a distributed collection of items called a 
Resilient Distributed Dataset (RDD). RDDs can be created from Hadoop 
InputFormats (such as HDFS files) or by transforming other RDDs. Let's make a 
new RDD from the text of the README file in the Spark source directory:
    +Spark's primary abstraction is a distributed collection of items called a 
Dataset. Datasets can be created from Hadoop InputFormats (such as HDFS files) 
or by transforming other Datasets. Let's make a new Dataset from the text of 
the README file in the Spark source directory:
     
     {% highlight scala %}
    -scala> val textFile = sc.textFile("README.md")
    -textFile: org.apache.spark.rdd.RDD[String] = README.md MapPartitionsRDD[1] 
at textFile at <console>:25
    +scala> val textFile = spark.read.textFile("README.md")
    +textFile: org.apache.spark.sql.Dataset[String] = [value: string]
     {% endhighlight %}
     
    -RDDs have _[actions](programming-guide.html#actions)_, which return 
values, and _[transformations](programming-guide.html#transformations)_, which 
return pointers to new RDDs. Let's start with a few actions:
    +You can get values from Dataset directly, by calling some actions, or 
transform the Dataset to get a new one. For more details, please read the _[API 
doc](api/scala/index.html#org.apache.spark.sql.Dataset)_.
     
     {% highlight scala %}
    -scala> textFile.count() // Number of items in this RDD
    +scala> textFile.count() // Number of items in this Dataset
     res0: Long = 126 // May be different from yours as README.md will change 
over time, similar to other outputs
     
    -scala> textFile.first() // First item in this RDD
    +scala> textFile.first() // First item in this Dataset
     res1: String = # Apache Spark
     {% endhighlight %}
     
    -Now let's use a transformation. We will use the 
[`filter`](programming-guide.html#transformations) transformation to return a 
new RDD with a subset of the items in the file.
    +Now let's transform this Dataset to a new one. We will call the `filter` 
to return a new Dataset with a subset of the items in the file.
    --- End diff --
    
    Just "call `filter`"?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #16856: [SPARK-19516][DOC] update public doc to use Spark...

Reply via email to