Is it because countByValue or toArray put too much stress on the driver, if 
there are many unique words 
To me it is a typical word count problem, then you can solve it as follows 
(correct me if I am wrong)

val textFile = sc.textFile(“file")
val counts = textFile.flatMap(line => line.split(" ")).map(word => (word, 
1)).reduceByKey((a, b) => a + b)
counts.saveAsTextFile(“file”)//any way you don’t want to collect results to 
master, and instead putting them in file.

Thanks.

Zhan Zhang

On Aug 16, 2014, at 9:18 AM, Jerry Ye <jerr...@gmail.com> wrote:

> The job ended up running overnight with no progress. :-(
> 
> 
> On Sat, Aug 16, 2014 at 12:16 AM, Jerry Ye <jerr...@gmail.com> wrote:
> 
>> Hi Xiangrui,
>> I actually tried branch-1.1 and master and it resulted in the job being
>> stuck at the TaskSetManager:
>> 14/08/16 06:55:48 INFO scheduler.TaskSchedulerImpl: Adding task set 1.0
>> with 2 tasks
>> 14/08/16 06:55:48 INFO scheduler.TaskSetManager: Starting task 1.0:0 as
>> TID 2 on executor 8: ip-10-226-199-225.us-west-2.compute.internal
>> (PROCESS_LOCAL)
>> 14/08/16 06:55:48 INFO scheduler.TaskSetManager: Serialized task 1.0:0 as
>> 28055875 bytes in 162 ms
>> 14/08/16 06:55:48 INFO scheduler.TaskSetManager: Starting task 1.0:1 as
>> TID 3 on executor 0: ip-10-249-53-62.us-west-2.compute.internal
>> (PROCESS_LOCAL)
>> 14/08/16 06:55:48 INFO scheduler.TaskSetManager: Serialized task 1.0:1 as
>> 28055875 bytes in 178 ms
>> 
>> It's been 10 minutes with no progress on relatively small data. I'll let
>> it run overnight and update in the morning. Is there some place that I
>> should look to see what is happening? I tried to ssh into the executor and
>> look at /root/spark/logs but there wasn't anything informative there.
>> 
>> I'm sure using CountByValue works fine but my use of a HashMap is only an
>> example. In my actual task, I'm loading a Trie data structure to perform
>> efficient string matching between a dataset of locations and strings
>> possibly containing mentions of locations.
>> 
>> This seems like a common thing, to process input with a relatively memory
>> intensive object like a Trie. I hope I'm not missing something obvious. Do
>> you know of any example code like my use case?
>> 
>> Thanks!
>> 
>> - jerry
>> 


-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Reply via email to