Your loop is deciding the files to process and then you are unioning the
data on each iteration. If you change it to load all the files at the same
time and let spark sort it out you should be much faster.

Untested:

 val rdd = sc.textFile("medline15n00*.xml")
 val words = rdd.flatMap( x=> x.split(" ") );
 words.map( x=> (x,1)).reduceByKey( (x,y) => (x+y) )
 words.saveAsTextFile("results")



shawn.c.carr...@gmail.com
Software Engineer
Soccer Referee

On Wed, Sep 16, 2015 at 2:07 PM, huajun <huajun.w...@gmail.com> wrote:

> Hi.
> I have a problem with this very simple word count rogram. The program works
> fine for
> thousands of similar files in the dataset but is very slow for these first
> 28 or so.
> The files are about 50 to 100 MB each
> and the program process other similar 28 files in about 30sec. These first
> 28 files, however, take 30min.
> This should not be a problem with the data in these files, as if I combine
> all the files into one
> bigger file, it will be processed in about 30sec.
>
> I am running spark in local mode (with > 100GB memory) and it is just use
> 100% CPU (one core) most of time (for this troubled case) and no network
> traffic is involved.
>
> Any obvious (or non-obvious) errors?
>
>     def process(file : String) : RDD[(String, Int)] = {
>       val rdd = sc.textFile(file)
>       val words = rdd.flatMap( x=> x.split(" ") );
>
>       words.map( x=> (x,1)).reduceByKey( (x,y) => (x+y) )
>     }
>
>     val file = "medline15n0001.xml"
>     var keep = process(file)
>
>     for (i <- 2 to 28) {
>       val file = if (i < 10) "medline15n000" + i + ".xml"
>                  else "medline15n00" + i + ".xml"
>
>       val result = process(file)
>       keep = result.union(keep);
>     }
>     keep = keep.reduceByKey( (x,y) => (x+y) )
>     keep.saveAsTextFile("results")
>
> Thanks.
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/problem-with-a-very-simple-word-count-program-tp24715.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Reply via email to