How to set executor num on spark on yarn

2014-09-15 Thread hequn cheng
hi~I want to set the executor number to 16, but it is very strange that
executor cores may affect executor num on spark on yarn, i don't know why
and how to set executor number.
=
./bin/spark-submit --class com.hequn.spark.SparkJoins \
--master yarn-cluster \
--num-executors 16 \
--driver-memory 2g \
--executor-memory 10g \
  *  --executor-cores 4 \*
/home/sparkjoins-1.0-SNAPSHOT.jar

The UI shows there are *7 executors*
=
./bin/spark-submit --class com.hequn.spark.SparkJoins \
--master yarn-cluster \
--num-executors 16 \
--driver-memory 2g \
--executor-memory 10g \
*--executor-cores 2 \*
/home/sparkjoins-1.0-SNAPSHOT.jar

The UI shows there are *9 executors*
=
./bin/spark-submit --class com.hequn.spark.SparkJoins \
--master yarn-cluster \
--num-executors 16 \
--driver-memory 2g \
--executor-memory 10g \
*--executor-cores 1 \*
/home/sparkjoins-1.0-SNAPSHOT.jar

The UI shows there are *9 executors*
==
The cluster contains 16 nodes. Each node 64G RAM.


Re: Hadoop 2.3 Centralized Cache vs RDD

2014-05-16 Thread hequn cheng
I tried centralized cache step by step following the apache hadoop oficial
website, but it seems centralized cache doesn't work.
see :
http://stackoverflow.com/questions/22293358/centralized-cache-failed-in-hadoop-2-3
.
Can anyone succeed?


2014-05-15 5:30 GMT+08:00 William Kang :

> Hi,
> Any comments or thoughts on the implications of the newly released feature
> from Hadoop 2.3 on the centralized cache? How different it is from RDD?
>
> Many thanks.
>
>
> Cao
>


Re: 答复: 答复: RDD usage

2014-03-25 Thread hequn cheng
Hi~I wrote a program to test.The non-idempotent "compute" function in
foreach does change the value of RDD. It may looks a little crazy to do so
since modify the RDD will make it impossible to keep RDD fault-tolerant in
spark :)



2014-03-25 11:11 GMT+08:00 林武康 :

>  Hi hequn, I dig into the source of spark a bit deeper, and I got some
> ideas, firstly, immutable is a feather of rdd but not a solid rule, there
> are ways to change it, for excample, a rdd with non-idempotent "compute"
> function, though it is really a bad design to make that function
> non-idempotent for uncontrollable side-effect. I agree with Mark that
> foreach can modify the elements of a rdd, but we should avoid this because
> it will effect all the rdds generate by this changed rdd , make the whole
> process inconsistent and unstable.
>
> Some rough opinions on the immutable feature of rdd, full discuss can make
> it more clear. Any ideas?
>  --
> 发件人: hequn cheng 
> 发送时间: 2014/3/25 10:40
> 收件人: user@spark.apache.org
> 主题: Re: 答复: RDD usage
>
>  First question:
> If you save your modified RDD like this:
> points.foreach(p=>p.y = another_value).collect() or
> points.foreach(p=>p.y = another_value).saveAsTextFile(...)
> the modified RDD will be materialized and this will not use any work's
> memory.
> If you have more transformatins after the map(), the spark will pipelines
> all transformations and build a DAG. Very little memory will be used in
> this stage and the memory will be free soon.
> Only cache() will persist your RDD in memory for a long time.
> Second question:
> Once RDD be created, it can not be changed due to the immutable
> feature.You can only create a new RDD from the existing RDD or from file
> system.
>
>
> 2014-03-25 9:45 GMT+08:00 林武康 :
>
>>  Hi hequn, a relative question, is that mean the memory usage will
>> doubled? And further more, if the compute function in a rdd is not
>> idempotent, rdd will changed during the job running, is that right?
>>  --
>> 发件人: hequn cheng 
>> 发送时间: 2014/3/25 9:35
>> 收件人: user@spark.apache.org
>> 主题: Re: RDD usage
>>
>>  points.foreach(p=>p.y = another_value) will return a new modified RDD.
>>
>>
>> 2014-03-24 18:13 GMT+08:00 Chieh-Yen :
>>
>>>  Dear all,
>>>
>>> I have a question about the usage of RDD.
>>> I implemented a class called AppDataPoint, it looks like:
>>>
>>> case class AppDataPoint(input_y : Double, input_x : Array[Double])
>>> extends Serializable {
>>>   var y : Double = input_y
>>>   var x : Array[Double] = input_x
>>>   ..
>>> }
>>> Furthermore, I created the RDD by the following function.
>>>
>>> def parsePoint(line: String): AppDataPoint = {
>>>   /* Some related works for parsing */
>>>   ..
>>> }
>>>
>>> Assume the RDD called "points":
>>>
>>> val lines = sc.textFile(inputPath, numPartition)
>>> var points = lines.map(parsePoint _).cache()
>>>
>>> The question is that, I tried to modify the value of this RDD, the
>>> operation is:
>>>
>>> points.foreach(p=>p.y = another_value)
>>>
>>> The operation is workable.
>>> There doesn't have any warning or error message showed by the system and
>>> the results are right.
>>> I wonder that if the modification for RDD is a correct and in fact
>>> workable design.
>>> The usage web said that the RDD is immutable, is there any suggestion?
>>>
>>> Thanks a lot.
>>>
>>> Chieh-Yen Lin
>>>
>>
>>
>


Re: 答复: RDD usage

2014-03-24 Thread hequn cheng
First question:
If you save your modified RDD like this:
points.foreach(p=>p.y = another_value).collect() or
points.foreach(p=>p.y = another_value).saveAsTextFile(...)
the modified RDD will be materialized and this will not use any work's
memory.
If you have more transformatins after the map(), the spark will pipelines
all transformations and build a DAG. Very little memory will be used in
this stage and the memory will be free soon.
Only cache() will persist your RDD in memory for a long time.
Second question:
Once RDD be created, it can not be changed due to the immutable feature.You
can only create a new RDD from the existing RDD or from file system.


2014-03-25 9:45 GMT+08:00 林武康 :

>  Hi hequn, a relative question, is that mean the memory usage will
> doubled? And further more, if the compute function in a rdd is not
> idempotent, rdd will changed during the job running, is that right?
>  ------
> 发件人: hequn cheng 
> 发送时间: 2014/3/25 9:35
> 收件人: user@spark.apache.org
> 主题: Re: RDD usage
>
> points.foreach(p=>p.y = another_value) will return a new modified RDD.
>
>
> 2014-03-24 18:13 GMT+08:00 Chieh-Yen :
>
>>  Dear all,
>>
>> I have a question about the usage of RDD.
>> I implemented a class called AppDataPoint, it looks like:
>>
>> case class AppDataPoint(input_y : Double, input_x : Array[Double])
>> extends Serializable {
>>   var y : Double = input_y
>>   var x : Array[Double] = input_x
>>   ..
>> }
>> Furthermore, I created the RDD by the following function.
>>
>> def parsePoint(line: String): AppDataPoint = {
>>   /* Some related works for parsing */
>>   ..
>> }
>>
>> Assume the RDD called "points":
>>
>> val lines = sc.textFile(inputPath, numPartition)
>> var points = lines.map(parsePoint _).cache()
>>
>> The question is that, I tried to modify the value of this RDD, the
>> operation is:
>>
>> points.foreach(p=>p.y = another_value)
>>
>> The operation is workable.
>> There doesn't have any warning or error message showed by the system and
>> the results are right.
>> I wonder that if the modification for RDD is a correct and in fact
>> workable design.
>> The usage web said that the RDD is immutable, is there any suggestion?
>>
>> Thanks a lot.
>>
>> Chieh-Yen Lin
>>
>
>


Re: RDD usage

2014-03-24 Thread hequn cheng
points.foreach(p=>p.y = another_value) will return a new modified RDD.


2014-03-24 18:13 GMT+08:00 Chieh-Yen :

> Dear all,
>
> I have a question about the usage of RDD.
> I implemented a class called AppDataPoint, it looks like:
>
> case class AppDataPoint(input_y : Double, input_x : Array[Double]) extends
> Serializable {
>   var y : Double = input_y
>   var x : Array[Double] = input_x
>   ..
> }
> Furthermore, I created the RDD by the following function.
>
> def parsePoint(line: String): AppDataPoint = {
>   /* Some related works for parsing */
>   ..
> }
>
> Assume the RDD called "points":
>
> val lines = sc.textFile(inputPath, numPartition)
> var points = lines.map(parsePoint _).cache()
>
> The question is that, I tried to modify the value of this RDD, the
> operation is:
>
> points.foreach(p=>p.y = another_value)
>
> The operation is workable.
> There doesn't have any warning or error message showed by the system and
> the results are right.
> I wonder that if the modification for RDD is a correct and in fact
> workable design.
> The usage web said that the RDD is immutable, is there any suggestion?
>
> Thanks a lot.
>
> Chieh-Yen Lin
>


Re: What's the lifecycle of an rdd? Can I control it?

2014-03-19 Thread hequn cheng
persist and unpersist.
unpersist:Mark the RDD as non-persistent, and remove all blocks for it from
memory and disk


2014-03-19 16:40 GMT+08:00 林武康 :

>  Hi, can any one tell me about the lifecycle of an rdd? I search through
> the official website and still can't figure it out. Can I use an rdd in
> some stages and destroy it in order to release memory because that no
> stages ahead will use this rdd any more. Is it possible?
>
> Thanks!
>
> Sincerely
> Lin wukang
>


How to set task number in a container

2014-03-11 Thread hequn cheng
When i increase my input data size, the executor will be failed and lost.
see below:

14/03/11 20:44:18 INFO AppClient$ClientActor: Executor updated:
app-20140311204343-0008/8 is now FAILED (Command exited with code 134)
14/03/11 20:44:18 INFO SparkDeploySchedulerBackend: Executor
app-20140311204343-0008/8 removed: Command exited with code 134
14/03/11 20:44:18 INFO SparkDeploySchedulerBackend: Executor 8
disconnected, so removing it
14/03/11 20:44:18 ERROR TaskSchedulerImpl: Lost executor 8 on
Salve10.Hadoop: Unknown executor exit code (134) (died from signal 6?)
14/03/11 20:44:18 INFO TaskSetManager: Re-queueing tasks for 8 from TaskSet
1.0
14/03/11 20:44:18 WARN TaskSetManager: Lost TID 8 (task 1.0:22)
14/03/11 20:44:18 WARN TaskSetManager: Lost TID 49 (task 1.0:49)
14/03/11 20:44:18 WARN TaskSetManager: Lost TID 31 (task 1.0:40)
14/03/11 20:44:18 WARN TaskSetManager: Lost TID 1 (task 1.0:10)
14/03/11 20:44:18 WARN TaskSetManager: Lost TID 15 (task 1.0:48)
14/03/11 20:44:18 INFO AppClient$ClientActor: Executor added:
app-20140311204343-0008/11 on worker-20140311172513-Salve10.Hadoop-56435
(Salve10.Hadoop:56435) with 8 cores
14/03/11 20:44:18 INFO SparkDeploySchedulerBackend: Granted executor ID
app-20140311204343-0008/11 on hostPort Salve10.Hadoop:56435 with 8 cores,
4.0 GB RAM
14/03/11 20:44:18 INFO AppClient$ClientActor: Executor updated:
app-20140311204343-0008/11 is now RUNNING

It seems that too many tasks are running in a container and the total task
memory exceed the "spark.executor.memory"

so how to set the task number in a executor?
thank you


Re: SPARK_JAVA_OPTS not picked up by the application

2014-03-10 Thread hequn cheng
have your send spark-env.sh to the slave nodes ?


2014-03-11 6:47 GMT+08:00 Linlin :

>
> Hi,
>
> I have a java option (-Xss) setting specified in SPARK_JAVA_OPTS in
> spark-env.sh,  noticed after stop/restart the spark cluster, the
> master/worker daemon has the setting being applied, but this setting is not
> being propagated to the executor, my application continue behave the same.
> I
> am not sure if there is a way to specify it through SparkConf? like
> SparkConf.set(), and what is the correct way of setting this up for a
> particular spark application.
>
> Thank you!
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/SPARK-JAVA-OPTS-not-picked-up-by-the-application-tp2483.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


subscribe

2014-03-10 Thread hequn cheng
hi


subscribe

2014-03-10 Thread hequn cheng
hi