Hi.
I have a problem with this very simple word count rogram. The program works
fine for
thousands of similar files in the dataset but is very slow for these first
28 or so.
The files are about 50 to 100 MB each
and the program process other similar 28 files in about 30sec. These first
28 files, however, take 30min.
This should not be a problem with the data in these files, as if I combine
all the files into one
bigger file, it will be processed in about 30sec.

I am running spark in local mode (with > 100GB memory) and it is just use
100% CPU (one core) most of time (for this troubled case) and no network
traffic is involved.

Any obvious (or non-obvious) errors?

    def process(file : String) : RDD[(String, Int)] = {
      val rdd = sc.textFile(file)
      val words = rdd.flatMap( x=> x.split(" ") );

      words.map( x=> (x,1)).reduceByKey( (x,y) => (x+y) )
    }
    
    val file = "medline15n0001.xml"
    var keep = process(file)

    for (i <- 2 to 28) {
      val file = if (i < 10) "medline15n000" + i + ".xml" 
                 else "medline15n00" + i + ".xml"
  
      val result = process(file)
      keep = result.union(keep);
    }
    keep = keep.reduceByKey( (x,y) => (x+y) )
    keep.saveAsTextFile("results")

Thanks.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/problem-with-a-very-simple-word-count-program-tp24715.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to