Hi Zhan, Thanks for looking into this. I'm actually using the hash map as an example of the simplest snippet of code that is failing for me. I know that this is just the word count. In my actual problem I'm using a Trie data structure to find substring matches.
On Sun, Aug 17, 2014 at 11:35 PM, Zhan Zhang <zzh...@hortonworks.com> wrote: > Is it because countByValue or toArray put too much stress on the driver, > if there are many unique words > To me it is a typical word count problem, then you can solve it as follows > (correct me if I am wrong) > > val textFile = sc.textFile(“file") > val counts = textFile.flatMap(line => line.split(" ")).map(word => (word, > 1)).reduceByKey((a, b) => a + b) > counts.saveAsTextFile(“file”)//any way you don’t want to collect results > to master, and instead putting them in file. > > Thanks. > > Zhan Zhang > > On Aug 16, 2014, at 9:18 AM, Jerry Ye <jerr...@gmail.com> wrote: > > > The job ended up running overnight with no progress. :-( > > > > > > On Sat, Aug 16, 2014 at 12:16 AM, Jerry Ye <jerr...@gmail.com> wrote: > > > >> Hi Xiangrui, > >> I actually tried branch-1.1 and master and it resulted in the job being > >> stuck at the TaskSetManager: > >> 14/08/16 06:55:48 INFO scheduler.TaskSchedulerImpl: Adding task set 1.0 > >> with 2 tasks > >> 14/08/16 06:55:48 INFO scheduler.TaskSetManager: Starting task 1.0:0 as > >> TID 2 on executor 8: ip-10-226-199-225.us-west-2.compute.internal > >> (PROCESS_LOCAL) > >> 14/08/16 06:55:48 INFO scheduler.TaskSetManager: Serialized task 1.0:0 > as > >> 28055875 bytes in 162 ms > >> 14/08/16 06:55:48 INFO scheduler.TaskSetManager: Starting task 1.0:1 as > >> TID 3 on executor 0: ip-10-249-53-62.us-west-2.compute.internal > >> (PROCESS_LOCAL) > >> 14/08/16 06:55:48 INFO scheduler.TaskSetManager: Serialized task 1.0:1 > as > >> 28055875 bytes in 178 ms > >> > >> It's been 10 minutes with no progress on relatively small data. I'll let > >> it run overnight and update in the morning. Is there some place that I > >> should look to see what is happening? I tried to ssh into the executor > and > >> look at /root/spark/logs but there wasn't anything informative there. > >> > >> I'm sure using CountByValue works fine but my use of a HashMap is only > an > >> example. In my actual task, I'm loading a Trie data structure to perform > >> efficient string matching between a dataset of locations and strings > >> possibly containing mentions of locations. > >> > >> This seems like a common thing, to process input with a relatively > memory > >> intensive object like a Trie. I hope I'm not missing something obvious. > Do > >> you know of any example code like my use case? > >> > >> Thanks! > >> > >> - jerry > >> > > > -- > CONFIDENTIALITY NOTICE > NOTICE: This message is intended for the use of the individual or entity to > which it is addressed and may contain information that is confidential, > privileged and exempt from disclosure under applicable law. If the reader > of this message is not the intended recipient, you are hereby notified that > any printing, copying, dissemination, distribution, disclosure or > forwarding of this communication is strictly prohibited. If you have > received this communication in error, please contact the sender immediately > and delete it from your system. Thank You. >