Min Li created SPARK-1967:
-----------------------------
Summary: Using parallelize method to create RDD, wordcount app
just hanging there without errors or warnings
Key: SPARK-1967
URL: https://issues.apache.org/jira/browse/SPARK-1967
Project: Spark
Issue Type: Bug
Affects Versions: 0.9.1
Environment: Ubuntu-12.04, single machine spark standalone, 8 core, 8G
mem, spark 0.9.1, java-1.7
Reporter: Min Li
I was trying the parallelize method to create RDD. I used Java. And it's a
simple wordcount program, except that I first read the input into memory and
then use the parallelize method to create the RDD, rather than the default
textFile method in the given example.
Pseudo codes:
JavaSparkContext ctx = new JavaSparkContext($SparkMasterURL, $NAME, $SparkHome,
$jars);
List<String> input = #read lines from input file and form a ArrayList<String>
JavaRDD lines = ctx.parallelize(input);
//followed by wordcount
----above is not working.
JavaRDD lines = ctx.textFile(file);
//followed by wordcount
----this is working
The log is:
14/05/29 16:18:43 INFO Slf4jLogger: Slf4jLogger started
14/05/29 16:18:43 INFO Remoting: Starting remoting
14/05/29 16:18:43 INFO Remoting: Remoting started; listening on addresses
:[akka.tcp://spark@spark:55224]
14/05/29 16:18:43 INFO Remoting: Remoting now listens on addresses:
[akka.tcp://spark@spark:55224]
14/05/29 16:18:43 INFO SparkEnv: Registering BlockManagerMaster
14/05/29 16:18:43 INFO DiskBlockManager: Created local directory at
/tmp/spark-local-20140529161843-836a
14/05/29 16:18:43 INFO MemoryStore: MemoryStore started with capacity 1056.0 MB.
14/05/29 16:18:43 INFO ConnectionManager: Bound socket to port 42942 with id =
ConnectionManagerId(spark,42942)
14/05/29 16:18:43 INFO BlockManagerMaster: Trying to register BlockManager
14/05/29 16:18:43 INFO BlockManagerMasterActor$BlockManagerInfo: Registering
block manager spark:42942 with 1056.0 MB RAM
14/05/29 16:18:43 INFO BlockManagerMaster: Registered BlockManager
14/05/29 16:18:43 INFO HttpServer: Starting HTTP Server
14/05/29 16:18:43 INFO HttpBroadcast: Broadcast server started at
http://10.227.119.185:43522
14/05/29 16:18:43 INFO SparkEnv: Registering MapOutputTracker
14/05/29 16:18:43 INFO HttpFileServer: HTTP File server directory is
/tmp/spark-3704a621-789c-4d97-b1fc-9654236dba3e
14/05/29 16:18:43 INFO HttpServer: Starting HTTP Server
14/05/29 16:18:43 INFO SparkUI: Started Spark Web UI at http://spark:4040
14/05/29 16:18:44 INFO SparkContext: Added JAR
/home/maxmin/tmp/spark-test-1.0-SNAPSHOT-jar-with-dependencies.jar at
http://10.227.119.185:55286/jars/spark-test-1.0-SNAPSHOT-jar-with-dependencies.jar
with timestamp 1401394724045
14/05/29 16:18:44 INFO AppClient$ClientActor: Connecting to master
spark://spark:7077...
14/05/29 16:18:44 INFO SparkDeploySchedulerBackend: Connected to Spark cluster
with app ID app-20140529161844-0001
14/05/29 16:18:44 INFO AppClient$ClientActor: Executor added:
app-20140529161844-0001/0 on worker-20140529155406-spark-59658 (spark:59658)
with 8 cores
The app is hanging here forever. And spark:8080 spark:4040 are not showing any
strange info. The Spark Stages page shows the Active Stages is reduceByKey,
tasks: Succeeded/Total is 0/2. I've also tried directly call lines.count after
parallelize, and the app will stuck at the count stage.
I used spark-0.9.1 and used default spark-env.sh. In the slaves file I have
only one host. I used maven to compile a fat jar with spark specified as
provided. I modified the run-example script to submit the jar.
--
This message was sent by Atlassian JIRA
(v6.2#6252)