Github user YanTangZhai commented on the pull request:

    https://github.com/apache/spark/pull/3794#issuecomment-68438167
  
    @JoshRosen Thanks for your comments. I've updated it according to your 
comments and contrived a simple example as follows:
    ```javascript
    val inputfile1 = "./testin/in_1.txt"
    val inputfile2 = "./testin/in_2.txt"
    val tempfile = "./testtmp"
    val outputfile = "./testout"
    val sc = new SparkContext(new SparkConf())
    sc.textFile(inputfile1)
      .flatMap(line => line.split(" "))
      .map(word => (word, 1))
      .reduceByKey(_ + _, 1)
      .map{kv => (kv._1 + "," + kv._2.toString)}
      .saveAsTextFile(tempfile)
    val wordCounts1 = sc.textFile(tempfile)
    val wordCounts2 = sc.textFile(inputfile2)
    val wordCounts = wordCounts1.union(wordCounts2)
    wordCounts.map{line =>
                        val kv = line.split(",")
                        (kv(0), Integer.parseInt(kv(1)))
                       }
                       .reduceByKey(_ + _, 1)
                       .map{kv => (kv._1 + "," + kv._2.toString)}
                       .saveAsTextFile(outputfile)
    ```
    ./testin/in_1.txt (23 bytes) and ./testin/in_2.txt (19 bytes) are all local 
files.
    - Before optimization,
     - job1
       <br/>New stage creation took 0.729638 s among which HadoopRDD 
getPartitions took 0.710247 s.
     - job2
       <br/>New stage creation took 0.882241 s among which 
HadoopRDD.getPartitions took 0.850668 + 0.023490 s.
    - After optimization,
     - job1
       <br/>HadoopRDD getPartitions took 0.802133 s.
       <br/>New stage creation took 0.029328 s.
     - job2
       <br/>HadoopRDD getPartitions took 0.464713 + 0.022568 s.
       <br/>New stage creation took 0.001773 s.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to