persistence iops and throughput check? Re: Running a spark code on multiple machines using google cloud platform

2017-02-02 Thread Heji Kim
Dear Anahita,

When we run performance tests for Spark/YARN clusters on GCP, we have to
make sure we are within iops and throughput limits.  Depending on disk type
(standard or SSD) and size of disk, you will only get so many max sustained
iops and throughput per sec. The GCP instance metrics graphs are not great
but enough to determine if you are over the limit.

https://cloud.google.com/compute/docs/disks/performance

Heji

On Thu, Feb 2, 2017 at 4:29 AM, Anahita Talebi 
wrote:

> Dear all,
>
> I am trying to run a spark code on multiple machines using submit job in
> google cloud platform.
> As the inputs of my code, I have a training and testing datasets.
>
> When I use small training data set like (10kb), the code can be
> successfully ran on the google cloud while when I have a large data set
> like 50Gb, I received the following error:
>
> 17/02/01 19:08:06 ERROR org.apache.spark.scheduler.LiveListenerBus: 
> SparkListenerBus has already stopped! Dropping event 
> SparkListenerTaskEnd(2,0,ResultTask,TaskKilled,org.apache.spark.scheduler.TaskInfo@3101f3b3,null)
>
> Does anyone can give me a hint how I can solve my problem?
>
> PS: I cannot use small training data set because I have an optimization code 
> which needs to use all the data.
>
> I have to use google could platform because I need to run the code on 
> multiple machines.
>
> Thanks a lot,
>
> Anahita
>
>


Re: Running a spark code on multiple machines using google cloud platform

2017-02-02 Thread Anahita Talebi
Thanks for your answer.
do you mean Amazon EMR?

On Thu, Feb 2, 2017 at 2:30 PM, Marco Mistroni  wrote:

> U can use EMR if u want to run. On a cluster
> Kr
>
> On 2 Feb 2017 12:30 pm, "Anahita Talebi" 
> wrote:
>
>> Dear all,
>>
>> I am trying to run a spark code on multiple machines using submit job in
>> google cloud platform.
>> As the inputs of my code, I have a training and testing datasets.
>>
>> When I use small training data set like (10kb), the code can be
>> successfully ran on the google cloud while when I have a large data set
>> like 50Gb, I received the following error:
>>
>> 17/02/01 19:08:06 ERROR org.apache.spark.scheduler.LiveListenerBus: 
>> SparkListenerBus has already stopped! Dropping event 
>> SparkListenerTaskEnd(2,0,ResultTask,TaskKilled,org.apache.spark.scheduler.TaskInfo@3101f3b3,null)
>>
>> Does anyone can give me a hint how I can solve my problem?
>>
>> PS: I cannot use small training data set because I have an optimization code 
>> which needs to use all the data.
>>
>> I have to use google could platform because I need to run the code on 
>> multiple machines.
>>
>> Thanks a lot,
>>
>> Anahita
>>
>>


Re: Running a spark code on multiple machines using google cloud platform

2017-02-02 Thread Marco Mistroni
U can use EMR if u want to run. On a cluster
Kr

On 2 Feb 2017 12:30 pm, "Anahita Talebi"  wrote:

> Dear all,
>
> I am trying to run a spark code on multiple machines using submit job in
> google cloud platform.
> As the inputs of my code, I have a training and testing datasets.
>
> When I use small training data set like (10kb), the code can be
> successfully ran on the google cloud while when I have a large data set
> like 50Gb, I received the following error:
>
> 17/02/01 19:08:06 ERROR org.apache.spark.scheduler.LiveListenerBus: 
> SparkListenerBus has already stopped! Dropping event 
> SparkListenerTaskEnd(2,0,ResultTask,TaskKilled,org.apache.spark.scheduler.TaskInfo@3101f3b3,null)
>
> Does anyone can give me a hint how I can solve my problem?
>
> PS: I cannot use small training data set because I have an optimization code 
> which needs to use all the data.
>
> I have to use google could platform because I need to run the code on 
> multiple machines.
>
> Thanks a lot,
>
> Anahita
>
>


Running a spark code on multiple machines using google cloud platform

2017-02-02 Thread Anahita Talebi
Dear all,

I am trying to run a spark code on multiple machines using submit job in
google cloud platform.
As the inputs of my code, I have a training and testing datasets.

When I use small training data set like (10kb), the code can be
successfully ran on the google cloud while when I have a large data set
like 50Gb, I received the following error:

17/02/01 19:08:06 ERROR org.apache.spark.scheduler.LiveListenerBus:
SparkListenerBus has already stopped! Dropping event
SparkListenerTaskEnd(2,0,ResultTask,TaskKilled,org.apache.spark.scheduler.TaskInfo@3101f3b3,null)

Does anyone can give me a hint how I can solve my problem?

PS: I cannot use small training data set because I have an
optimization code which needs to use all the data.

I have to use google could platform because I need to run the code on
multiple machines.

Thanks a lot,

Anahita