persistence iops and throughput check? Re: Running a spark code on multiple machines using google cloud platform
Dear Anahita, When we run performance tests for Spark/YARN clusters on GCP, we have to make sure we are within iops and throughput limits. Depending on disk type (standard or SSD) and size of disk, you will only get so many max sustained iops and throughput per sec. The GCP instance metrics graphs are not great but enough to determine if you are over the limit. https://cloud.google.com/compute/docs/disks/performance Heji On Thu, Feb 2, 2017 at 4:29 AM, Anahita Talebiwrote: > Dear all, > > I am trying to run a spark code on multiple machines using submit job in > google cloud platform. > As the inputs of my code, I have a training and testing datasets. > > When I use small training data set like (10kb), the code can be > successfully ran on the google cloud while when I have a large data set > like 50Gb, I received the following error: > > 17/02/01 19:08:06 ERROR org.apache.spark.scheduler.LiveListenerBus: > SparkListenerBus has already stopped! Dropping event > SparkListenerTaskEnd(2,0,ResultTask,TaskKilled,org.apache.spark.scheduler.TaskInfo@3101f3b3,null) > > Does anyone can give me a hint how I can solve my problem? > > PS: I cannot use small training data set because I have an optimization code > which needs to use all the data. > > I have to use google could platform because I need to run the code on > multiple machines. > > Thanks a lot, > > Anahita > >
Re: Running a spark code on multiple machines using google cloud platform
Thanks for your answer. do you mean Amazon EMR? On Thu, Feb 2, 2017 at 2:30 PM, Marco Mistroniwrote: > U can use EMR if u want to run. On a cluster > Kr > > On 2 Feb 2017 12:30 pm, "Anahita Talebi" > wrote: > >> Dear all, >> >> I am trying to run a spark code on multiple machines using submit job in >> google cloud platform. >> As the inputs of my code, I have a training and testing datasets. >> >> When I use small training data set like (10kb), the code can be >> successfully ran on the google cloud while when I have a large data set >> like 50Gb, I received the following error: >> >> 17/02/01 19:08:06 ERROR org.apache.spark.scheduler.LiveListenerBus: >> SparkListenerBus has already stopped! Dropping event >> SparkListenerTaskEnd(2,0,ResultTask,TaskKilled,org.apache.spark.scheduler.TaskInfo@3101f3b3,null) >> >> Does anyone can give me a hint how I can solve my problem? >> >> PS: I cannot use small training data set because I have an optimization code >> which needs to use all the data. >> >> I have to use google could platform because I need to run the code on >> multiple machines. >> >> Thanks a lot, >> >> Anahita >> >>
Re: Running a spark code on multiple machines using google cloud platform
U can use EMR if u want to run. On a cluster Kr On 2 Feb 2017 12:30 pm, "Anahita Talebi"wrote: > Dear all, > > I am trying to run a spark code on multiple machines using submit job in > google cloud platform. > As the inputs of my code, I have a training and testing datasets. > > When I use small training data set like (10kb), the code can be > successfully ran on the google cloud while when I have a large data set > like 50Gb, I received the following error: > > 17/02/01 19:08:06 ERROR org.apache.spark.scheduler.LiveListenerBus: > SparkListenerBus has already stopped! Dropping event > SparkListenerTaskEnd(2,0,ResultTask,TaskKilled,org.apache.spark.scheduler.TaskInfo@3101f3b3,null) > > Does anyone can give me a hint how I can solve my problem? > > PS: I cannot use small training data set because I have an optimization code > which needs to use all the data. > > I have to use google could platform because I need to run the code on > multiple machines. > > Thanks a lot, > > Anahita > >
Running a spark code on multiple machines using google cloud platform
Dear all, I am trying to run a spark code on multiple machines using submit job in google cloud platform. As the inputs of my code, I have a training and testing datasets. When I use small training data set like (10kb), the code can be successfully ran on the google cloud while when I have a large data set like 50Gb, I received the following error: 17/02/01 19:08:06 ERROR org.apache.spark.scheduler.LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerTaskEnd(2,0,ResultTask,TaskKilled,org.apache.spark.scheduler.TaskInfo@3101f3b3,null) Does anyone can give me a hint how I can solve my problem? PS: I cannot use small training data set because I have an optimization code which needs to use all the data. I have to use google could platform because I need to run the code on multiple machines. Thanks a lot, Anahita