Hi. I have a problem with this very simple word count rogram. The program works fine for thousands of similar files in the dataset but is very slow for these first 28 or so. The files are about 50 to 100 MB each and the program process other similar 28 files in about 30sec. These first 28 files, however, take 30min. This should not be a problem with the data in these files, as if I combine all the files into one bigger file, it will be processed in about 30sec.
I am running spark in local mode (with > 100GB memory) and it is just use 100% CPU (one core) most of time (for this troubled case) and no network traffic is involved. Any obvious (or non-obvious) errors? def process(file : String) : RDD[(String, Int)] = { val rdd = sc.textFile(file) val words = rdd.flatMap( x=> x.split(" ") ); words.map( x=> (x,1)).reduceByKey( (x,y) => (x+y) ) } val file = "medline15n0001.xml" var keep = process(file) for (i <- 2 to 28) { val file = if (i < 10) "medline15n000" + i + ".xml" else "medline15n00" + i + ".xml" val result = process(file) keep = result.union(keep); } keep = keep.reduceByKey( (x,y) => (x+y) ) keep.saveAsTextFile("results") Thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/problem-with-a-very-simple-word-count-program-tp24715.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org