viirya commented on code in PR #493:
URL: https://github.com/apache/spark-website/pull/493#discussion_r1426326048


##########
examples.md:
##########
@@ -36,36 +42,38 @@ In this page, we will show examples using RDD API as well 
as examples using high
 <div class="tab-pane tab-pane-python active">
 <div class="code code-tab">
 {% highlight python %}
-text_file = sc.textFile("hdfs://...")
-counts = text_file.flatMap(lambda line: line.split(" ")) \
-             .map(lambda word: (word, 1)) \
-             .reduceByKey(lambda a, b: a + b)
-counts.saveAsTextFile("hdfs://...")
+df = spark.read.text("hdfs://...").toDF("text")
+
+df.select(explode(split(col("text"), " ")).alias("word")) \
+  .groupBy("word") \
+  .agg(count(lit(0)).alias("count")) \
+  .write.parquet("hdfs://...")
 {% endhighlight %}
 </div>
 </div>
 
 <div class="tab-pane tab-pane-scala">
 <div class="code code-tab">
 {% highlight scala %}
-val textFile = sc.textFile("hdfs://...")
-val counts = textFile.flatMap(line => line.split(" "))
-                 .map(word => (word, 1))
-                 .reduceByKey(_ + _)
-counts.saveAsTextFile("hdfs://...")
+val df = spark.read.text("hdfs://...").toDF("text")
+
+df.select(explode(split(col("text"), " ")).alias("word"))
+  .groupBy("word")
+  .agg(count(lit(0)).alias("count"))
+  .write.parquet("hdfs://...")
 {% endhighlight %}
 </div>
 </div>
 
 <div class="tab-pane tab-pane-java">
 <div class="code code-tab">
 {% highlight java %}
-JavaRDD<String> textFile = sc.textFile("hdfs://...");
-JavaPairRDD<String, Integer> counts = textFile
-    .flatMap(s -> Arrays.asList(s.split(" ")).iterator())
-    .mapToPair(word -> new Tuple2<>(word, 1))
-    .reduceByKey((a, b) -> a + b);
-counts.saveAsTextFile("hdfs://...");
+DataFrame df = spark.read.text("hdfs://...").toDF("text");
+
+df.select(explode(split(col("text"), " ")).alias("word"))
+  .groupBy("word")
+  .agg(count(lit(0)).alias("count"))
+  .write.parquet("hdfs://...");

Review Comment:
   ```suggestion
   df.select(explode(split(col("text"), " ")).alias("word"))
     .groupBy("word")
     .agg(count(lit(0)).alias("count"))
     .write().parquet("hdfs://...");
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

Reply via email to