Re: Re: How to optimize the performance of Beam on Spark(Internet mail)

2018-09-28 Thread Tim Robertson
Thanks for sharing those results.

The second set (executors at 20-30) look similar to what I would have
expected.
BEAM-5036 definitely plays a part here as the data is not moved on HDFS
efficiently (fix in PR awaiting review now [1]).

To give an idea of the impact, here are some numbers from my own tests.
Without knowing your code, I presume mine is similar to your filter (take
data, modify it, write data with no shuffle/group/join)

My environment: 10 node YARN CDH 5.12.2 cluster, rewriting a 1.5TB AvroIO
file (code here [2]) I observed:

  - Using Spark API: 35 minutes
  - Beam AvroIO (2.6.0): 1.7hrs
  - Beam AvroIO with the 5036 fix: 42 minutes

Related: I also anticipate that varying the spark.default.parallelism will
affect Beam runtime.

Thanks,
Tim


[1] https://github.com/apache/beam/pull/6289
[2] https://github.com/gbif/beam-perf/tree/master/avro-to-avro


On Fri, Sep 28, 2018 at 9:27 AM Robert Bradshaw  wrote:

> Something here on the Beam side is clearly linear in the input size, as if
> there's a bottleneck where were' not able to get any parallelization. Is
> the spark variant running in parallel?
>
> On Fri, Sep 28, 2018 at 4:57 AM devinduan(段丁瑞) 
> wrote:
>
>> Hi
>> I have completed my test.
>> 1. Spark parameter :
>> deploy-mode client
>> executor-memory 1g
>> num-executors 1
>> driver-memory 1g
>>
>> WordCount:
>>
>>
>>
>> 300MB
>>
>> 600MB
>>
>> 1.2G
>>
>> Spark
>>
>> 1min8s
>>
>> 1min11s
>>
>> 1min18s
>>
>> Beam
>>
>> 6.4min
>>
>> 11min
>>
>> 22min
>>
>>
>>
>> Filter:
>>
>>
>>
>> 300MB
>>
>> 600MB
>>
>> 1.2G
>>
>> Spark
>>
>> 1.2min
>>
>> 1.7min
>>
>> 2.8min
>>
>> Beam
>>
>> 2.7min
>>
>> 4.1min
>>
>> 5.7min
>>
>>
>>
>> GroupbyKey + sum
>>
>>
>>
>> 300MB
>>
>> 600MB
>>
>> 1.2G
>>
>> Spark
>>
>> 3.6min
>>
>>
>>
>>
>>
>> Beam
>>
>> Failed, executor oom
>>
>>
>>
>>
>>
>>
>>
>> Union
>>
>>
>>
>> 300MB
>>
>> 600MB
>>
>> 1.2G
>>
>> Spark
>>
>> 1.7min
>>
>> 2.6min
>>
>> 5.1min
>>
>> Beam
>>
>> 3.6min
>>
>> 6.2min
>>
>> 11min
>>
>>
>>
>> 2. Spark parameter :
>>
>> deploy-mode client
>>
>> executor-memory 1g
>>
>> driver-memory 1g
>>
>> spark.dynamicAllocation.enabledtrue
>>
>


Re: Re: How to optimize the performance of Beam on Spark(Internet mail)

2018-09-28 Thread Robert Bradshaw
Something here on the Beam side is clearly linear in the input size, as if
there's a bottleneck where were' not able to get any parallelization. Is
the spark variant running in parallel?

On Fri, Sep 28, 2018 at 4:57 AM devinduan(段丁瑞) 
wrote:

> Hi
> I have completed my test.
> 1. Spark parameter :
> deploy-mode client
> executor-memory 1g
> num-executors 1
> driver-memory 1g
>
> WordCount:
>
>
>
> 300MB
>
> 600MB
>
> 1.2G
>
> Spark
>
> 1min8s
>
> 1min11s
>
> 1min18s
>
> Beam
>
> 6.4min
>
> 11min
>
> 22min
>
>
>
> Filter:
>
>
>
> 300MB
>
> 600MB
>
> 1.2G
>
> Spark
>
> 1.2min
>
> 1.7min
>
> 2.8min
>
> Beam
>
> 2.7min
>
> 4.1min
>
> 5.7min
>
>
>
> GroupbyKey + sum
>
>
>
> 300MB
>
> 600MB
>
> 1.2G
>
> Spark
>
> 3.6min
>
>
>
>
>
> Beam
>
> Failed, executor oom
>
>
>
>
>
>
>
> Union
>
>
>
> 300MB
>
> 600MB
>
> 1.2G
>
> Spark
>
> 1.7min
>
> 2.6min
>
> 5.1min
>
> Beam
>
> 3.6min
>
> 6.2min
>
> 11min
>
>
>
> 2. Spark parameter :
>
> deploy-mode client
>
> executor-memory 1g
>
> driver-memory 1g
>
> spark.dynamicAllocation.enabledtrue
>


Re: Re: How to optimize the performance of Beam on Spark(Internet mail)

2018-09-19 Thread 段丁瑞
Got it.
I will also set "spark.dynamicAllocation.enabled=true" to test.


From: Tim Robertson
Date: 2018-09-19 17:04
To: dev@beam.apache.org
CC: j...@nanthrax.net
Subject: Re: Re: How to optimize the performance of Beam on Spark(Internet mail)
Thank you Devin

Can you also please try Beam with more spark executors if you are able?

On Wed, Sep 19, 2018 at 10:47 AM devinduan(段丁瑞) 
mailto:devind...@tencent.com>> wrote:
Thanks for your help!
I will test other examples of Beam On Spark in the future and then feed back 
the results.
Regards
devin

From: Jean-Baptiste Onofré
Date: 2018-09-19 16:32
To: devinduan(段丁瑞); 
dev
Subject: Re: How to optimize the performance of Beam on Spark(Internet mail)

Thanks for the details.

I will take a look later tomorrow (I have another issue to investigate
on the Spark runner today for Beam 2.7.0 release).

Regards
JB

On 19/09/2018 08:31, devinduan(段丁瑞) wrote:
> Hi,
> I test 300MB data file.
> Use command like:
> ./spark-submit --master yarn --deploy-mode client  --class
> com.test.BeamTest --executor-memory 1g --num-executors 1 --driver-memory 1g
>
>  I set only one exeuctor. so task run in sequence . One task cost 10s.
> However, Spark task cost only 0.4s
>
>
>
> *From:* Jean-Baptiste Onofré 
> *Date:* 2018-09-19 12:22
> *To:* dev@beam.apache.org 
> 
> *Subject:* Re: How to optimize the performance of Beam on
> Spark(Internet mail)
>
> Hi,
>
> did you compare the stages in the Spark UI in order to identify which
> stage is taking time ?
>
> You use spark-submit in both cases for the bootstrapping ?
>
> I will do a test here as well.
>
> Regards
> JB
>
> On 19/09/2018 05:34, devinduan(段丁瑞) wrote:
> > Hi,
> > Thanks for you reply.
> > Our team plan to use Beam instead of Spark, So I'm testing the
> > performance of Beam API.
> > I'm coding some example through Spark API and Beam API , like
> > "WordCount" , "Join",  "OrderBy",  "Union" ...
> > I use the same Resources and configuration to run these Job.
> >Tim said I should remove "withNumShards(1)" and
> > set spark.default.parallelism=32. I did it and tried again, but
> Beam job
> > still running very slowly.
> > Here is My Beam code and Spark code:
> >Beam "WordCount":
> >
> >Spark "WordCount":
> >
> >I will try the other example later.
> >
> > Regards
> > devin
> >
> >
> > *From:* Jean-Baptiste Onofré 
> > *Date:* 2018-09-18 22:43
> > *To:* dev@beam.apache.org 
> 
> > *Subject:* Re: How to optimize the performance of Beam on
> > Spark(Internet mail)
> >
> > Hi,
> >
> > The first huge difference is the fact that the spark runner
> still uses
> > RDD whereas directly using spark, you are using dataset. A
> bunch of
> > optimization in spark are related to dataset.
> >
> > I started a large refactoring of the spark runner to leverage
> Spark 2.x
> > (and dataset).
> > It's not yet ready as it includes other improvements (the
> portability
> > layer with Job API, a first check of state API, ...).
> >
> > Anyway, by Spark wordcount, you mean the one included in the spark
> > distribution ?
> >
> > Regards
> > JB
> >
> > On 18/09/2018 08:39, devinduan(段丁瑞) wrote:
> > > Hi,
> > > I'm testing Beam on Spark.
> > > I use spark example code WordCount processing 1G data
> file, cost 1
> > > minutes.
> > > However, I use Beam example code WordCount processing
> the same
> > file,
> > > cost 30minutes.
> > > My Spark parameter is :  --deploy-mode client
> >  --executor-memory 1g
> > > --num-executors 1 --driver-memory 1g
> > > My Spark version is 2.3.1,  Beam version is 2.5
> > > Is there any optimization method?
> > > Thank you.
> > >
> > >
> >
> > --
> > Jean-Baptiste Onofré
> > jbono...@apache.org
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
> >
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com



Re: Re: How to optimize the performance of Beam on Spark(Internet mail)

2018-09-19 Thread Tim Robertson
Thank you Devin

Can you also please try Beam with more spark executors if you are able?

On Wed, Sep 19, 2018 at 10:47 AM devinduan(段丁瑞) 
wrote:

> Thanks for your help!
> I will test other examples of Beam On Spark in the future and then feed
> back the results.
> Regards
> devin
>
>
> *From:* Jean-Baptiste Onofré 
> *Date:* 2018-09-19 16:32
> *To:* devinduan(段丁瑞) ; dev 
> *Subject:* Re: How to optimize the performance of Beam on Spark(Internet
> mail)
>
> Thanks for the details.
>
> I will take a look later tomorrow (I have another issue to investigate
> on the Spark runner today for Beam 2.7.0 release).
>
> Regards
> JB
>
> On 19/09/2018 08:31, devinduan(段丁瑞) wrote:
> > Hi,
> > I test 300MB data file.
> > Use command like:
> > ./spark-submit --master yarn --deploy-mode client  --class
> > com.test.BeamTest --executor-memory 1g --num-executors 1 --driver-memory
> 1g
> >
> >  I set only one exeuctor. so task run in sequence . One task cost 10s.
> > However, Spark task cost only 0.4s
> >
> >
> >
> > *From:* Jean-Baptiste Onofré  >
> > *Date:* 2018-09-19 12:22
> > *To:* dev@beam.apache.org  >
> > *Subject:* Re: How to optimize the performance of Beam on
> > Spark(Internet mail)
> >
> > Hi,
> >
> > did you compare the stages in the Spark UI in order to identify which
> > stage is taking time ?
> >
> > You use spark-submit in both cases for the bootstrapping ?
> >
> > I will do a test here as well.
> >
> > Regards
> > JB
> >
> > On 19/09/2018 05:34, devinduan(段丁瑞) wrote:
> > > Hi,
> > > Thanks for you reply.
> > > Our team plan to use Beam instead of Spark, So I'm testing the
> > > performance of Beam API.
> > > I'm coding some example through Spark API and Beam API , like
> > > "WordCount" , "Join",  "OrderBy",  "Union" ...
> > > I use the same Resources and configuration to run these Job.
> > >Tim said I should remove "withNumShards(1)" and
> > > set spark.default.parallelism=32. I did it and tried again, but
> > Beam job
> > > still running very slowly.
> > > Here is My Beam code and Spark code:
> > >Beam "WordCount":
> > >
> > >Spark "WordCount":
> > >
> > >I will try the other example later.
> > >
> > > Regards
> > > devin
> > >
> > >
> > > *From:* Jean-Baptiste Onofré  >
> > > *Date:* 2018-09-18 22:43
> > > *To:* dev@beam.apache.org  >
> > > *Subject:* Re: How to optimize the performance of Beam on
> > > Spark(Internet mail)
> > >
> > > Hi,
> > >
> > > The first huge difference is the fact that the spark runner
> > still uses
> > > RDD whereas directly using spark, you are using dataset. A
> > bunch of
> > > optimization in spark are related to dataset.
> > >
> > > I started a large refactoring of the spark runner to leverage
> > Spark 2.x
> > > (and dataset).
> > > It's not yet ready as it includes other improvements (the
> > portability
> > > layer with Job API, a first check of state API, ...).
> > >
> > > Anyway, by Spark wordcount, you mean the one included in the
> spark
> > > distribution ?
> > >
> > > Regards
> > > JB
> > >
> > > On 18/09/2018 08:39, devinduan(段丁瑞) wrote:
> > > > Hi,
> > > > I'm testing Beam on Spark.
> > > > I use spark example code WordCount processing 1G data
> > file, cost 1
> > > > minutes.
> > > > However, I use Beam example code WordCount processing
> > the same
> > > file,
> > > > cost 30minutes.
> > > > My Spark parameter is :  --deploy-mode client
> > >  --executor-memory 1g
> > > > --num-executors 1 --driver-memory 1g
> > > > My Spark version is 2.3.1,  Beam version is 2.5
> > > > Is there any optimization method?
> > > > Thank you.
> > > >
> > > >
> > >
> > > --
> > > Jean-Baptiste Onofré
> > > jbono...@apache.org
> > > http://blog.nanthrax.net
> > > Talend - http://www.talend.com
> > >
> >
> > --
> > Jean-Baptiste Onofré
> > jbono...@apache.org
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
> >
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>
>


Re: Re: How to optimize the performance of Beam on Spark(Internet mail)

2018-09-19 Thread 段丁瑞
Thanks for your help!
I will test other examples of Beam On Spark in the future and then feed back 
the results.
Regards
devin

From: Jean-Baptiste Onofré
Date: 2018-09-19 16:32
To: devinduan(段丁瑞); 
dev
Subject: Re: How to optimize the performance of Beam on Spark(Internet mail)

Thanks for the details.

I will take a look later tomorrow (I have another issue to investigate
on the Spark runner today for Beam 2.7.0 release).

Regards
JB

On 19/09/2018 08:31, devinduan(段丁瑞) wrote:
> Hi,
> I test 300MB data file.
> Use command like:
> ./spark-submit --master yarn --deploy-mode client  --class
> com.test.BeamTest --executor-memory 1g --num-executors 1 --driver-memory 1g
>
>  I set only one exeuctor. so task run in sequence . One task cost 10s.
> However, Spark task cost only 0.4s
>
>
>
> *From:* Jean-Baptiste Onofré 
> *Date:* 2018-09-19 12:22
> *To:* dev@beam.apache.org 
> *Subject:* Re: How to optimize the performance of Beam on
> Spark(Internet mail)
>
> Hi,
>
> did you compare the stages in the Spark UI in order to identify which
> stage is taking time ?
>
> You use spark-submit in both cases for the bootstrapping ?
>
> I will do a test here as well.
>
> Regards
> JB
>
> On 19/09/2018 05:34, devinduan(段丁瑞) wrote:
> > Hi,
> > Thanks for you reply.
> > Our team plan to use Beam instead of Spark, So I'm testing the
> > performance of Beam API.
> > I'm coding some example through Spark API and Beam API , like
> > "WordCount" , "Join",  "OrderBy",  "Union" ...
> > I use the same Resources and configuration to run these Job.
> >Tim said I should remove "withNumShards(1)" and
> > set spark.default.parallelism=32. I did it and tried again, but
> Beam job
> > still running very slowly.
> > Here is My Beam code and Spark code:
> >Beam "WordCount":
> >
> >Spark "WordCount":
> >
> >I will try the other example later.
> >
> > Regards
> > devin
> >
> >
> > *From:* Jean-Baptiste Onofré 
> > *Date:* 2018-09-18 22:43
> > *To:* dev@beam.apache.org 
> > *Subject:* Re: How to optimize the performance of Beam on
> > Spark(Internet mail)
> >
> > Hi,
> >
> > The first huge difference is the fact that the spark runner
> still uses
> > RDD whereas directly using spark, you are using dataset. A
> bunch of
> > optimization in spark are related to dataset.
> >
> > I started a large refactoring of the spark runner to leverage
> Spark 2.x
> > (and dataset).
> > It's not yet ready as it includes other improvements (the
> portability
> > layer with Job API, a first check of state API, ...).
> >
> > Anyway, by Spark wordcount, you mean the one included in the spark
> > distribution ?
> >
> > Regards
> > JB
> >
> > On 18/09/2018 08:39, devinduan(段丁瑞) wrote:
> > > Hi,
> > > I'm testing Beam on Spark.
> > > I use spark example code WordCount processing 1G data
> file, cost 1
> > > minutes.
> > > However, I use Beam example code WordCount processing
> the same
> > file,
> > > cost 30minutes.
> > > My Spark parameter is :  --deploy-mode client
> >  --executor-memory 1g
> > > --num-executors 1 --driver-memory 1g
> > > My Spark version is 2.3.1,  Beam version is 2.5
> > > Is there any optimization method?
> > > Thank you.
> > >
> > >
> >
> > --
> > Jean-Baptiste Onofré
> > jbono...@apache.org
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
> >
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com



Re: How to optimize the performance of Beam on Spark(Internet mail)

2018-09-19 Thread Jean-Baptiste Onofré
Thanks for the details.

I will take a look later tomorrow (I have another issue to investigate
on the Spark runner today for Beam 2.7.0 release).

Regards
JB

On 19/09/2018 08:31, devinduan(段丁瑞) wrote:
> Hi,
>     I test 300MB data file.
>     Use command like:
>     ./spark-submit --master yarn --deploy-mode client  --class
> com.test.BeamTest --executor-memory 1g --num-executors 1 --driver-memory 1g 
>    
>  I set only one exeuctor. so task run in sequence . One task cost 10s.
> However, Spark task cost only 0.4s
> 
> 
> 
> *From:* Jean-Baptiste Onofré 
> *Date:* 2018-09-19 12:22
> *To:* dev@beam.apache.org 
> *Subject:* Re: How to optimize the performance of Beam on
> Spark(Internet mail)
> 
> Hi,
> 
> did you compare the stages in the Spark UI in order to identify which
> stage is taking time ?
> 
> You use spark-submit in both cases for the bootstrapping ?
> 
> I will do a test here as well.
> 
> Regards
> JB
> 
> On 19/09/2018 05:34, devinduan(段丁瑞) wrote:
> > Hi,
> >     Thanks for you reply.
> >     Our team plan to use Beam instead of Spark, So I'm testing the
> > performance of Beam API.
> >     I'm coding some example through Spark API and Beam API , like
> > "WordCount" , "Join",  "OrderBy",  "Union" ...
> >     I use the same Resources and configuration to run these Job.   
> >    Tim said I should remove "withNumShards(1)" and
> > set spark.default.parallelism=32. I did it and tried again, but
> Beam job
> > still running very slowly.
> >     Here is My Beam code and Spark code:
> >    Beam "WordCount":
> >     
> >    Spark "WordCount":
> >
> >    I will try the other example later.
> >     
> > Regards
> > devin
> >
> >  
> > *From:* Jean-Baptiste Onofré 
> > *Date:* 2018-09-18 22:43
> > *To:* dev@beam.apache.org 
> > *Subject:* Re: How to optimize the performance of Beam on
> > Spark(Internet mail)
> >
> > Hi,
> >
> > The first huge difference is the fact that the spark runner
> still uses
> > RDD whereas directly using spark, you are using dataset. A
> bunch of
> > optimization in spark are related to dataset.
> >
> > I started a large refactoring of the spark runner to leverage
> Spark 2.x
> > (and dataset).
> > It's not yet ready as it includes other improvements (the
> portability
> > layer with Job API, a first check of state API, ...).
> >
> > Anyway, by Spark wordcount, you mean the one included in the spark
> > distribution ?
> >
> > Regards
> > JB
> >
> > On 18/09/2018 08:39, devinduan(段丁瑞) wrote:
> > > Hi,
> > >     I'm testing Beam on Spark. 
> > >     I use spark example code WordCount processing 1G data
> file, cost 1
> > > minutes.
> > >     However, I use Beam example code WordCount processing
> the same
> > file,
> > > cost 30minutes.
> > >     My Spark parameter is :  --deploy-mode client
> >  --executor-memory 1g
> > > --num-executors 1 --driver-memory 1g
> > >     My Spark version is 2.3.1,  Beam version is 2.5
> > >     Is there any optimization method?
> > > Thank you.
> > >
> > >    
> >
> > --
> > Jean-Baptiste Onofré
> > jbono...@apache.org
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
> >
> 
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
> 

-- 
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: How to optimize the performance of Beam on Spark(Internet mail)

2018-09-18 Thread Jean-Baptiste Onofré
Hi,

did you compare the stages in the Spark UI in order to identify which
stage is taking time ?

You use spark-submit in both cases for the bootstrapping ?

I will do a test here as well.

Regards
JB

On 19/09/2018 05:34, devinduan(段丁瑞) wrote:
> Hi,
>     Thanks for you reply.
>     Our team plan to use Beam instead of Spark, So I'm testing the
> performance of Beam API.
>     I'm coding some example through Spark API and Beam API , like
> "WordCount" , "Join",  "OrderBy",  "Union" ...
>     I use the same Resources and configuration to run these Job.   
>    Tim said I should remove "withNumShards(1)" and
> set spark.default.parallelism=32. I did it and tried again, but Beam job
> still running very slowly.
>     Here is My Beam code and Spark code:
>    Beam "WordCount":
>     
>    Spark "WordCount":
> 
>    I will try the other example later.
>     
> Regards
> devin
> 
>  
> *From:* Jean-Baptiste Onofré 
> *Date:* 2018-09-18 22:43
> *To:* dev@beam.apache.org 
> *Subject:* Re: How to optimize the performance of Beam on
> Spark(Internet mail)
> 
> Hi,
> 
> The first huge difference is the fact that the spark runner still uses
> RDD whereas directly using spark, you are using dataset. A bunch of
> optimization in spark are related to dataset.
> 
> I started a large refactoring of the spark runner to leverage Spark 2.x
> (and dataset).
> It's not yet ready as it includes other improvements (the portability
> layer with Job API, a first check of state API, ...).
> 
> Anyway, by Spark wordcount, you mean the one included in the spark
> distribution ?
> 
> Regards
> JB
> 
> On 18/09/2018 08:39, devinduan(段丁瑞) wrote:
> > Hi,
> >     I'm testing Beam on Spark. 
> >     I use spark example code WordCount processing 1G data file, cost 1
> > minutes.
> >     However, I use Beam example code WordCount processing the same
> file,
> > cost 30minutes.
> >     My Spark parameter is :  --deploy-mode client
>  --executor-memory 1g
> > --num-executors 1 --driver-memory 1g
> >     My Spark version is 2.3.1,  Beam version is 2.5
> >     Is there any optimization method?
> > Thank you.
> >
> >    
> 
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
> 

-- 
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: How to optimize the performance of Beam on Spark

2018-09-18 Thread Jean-Baptiste Onofré
Hi,

The first huge difference is the fact that the spark runner still uses
RDD whereas directly using spark, you are using dataset. A bunch of
optimization in spark are related to dataset.

I started a large refactoring of the spark runner to leverage Spark 2.x
(and dataset).
It's not yet ready as it includes other improvements (the portability
layer with Job API, a first check of state API, ...).

Anyway, by Spark wordcount, you mean the one included in the spark
distribution ?

Regards
JB

On 18/09/2018 08:39, devinduan(段丁瑞) wrote:
> Hi,
>     I'm testing Beam on Spark. 
>     I use spark example code WordCount processing 1G data file, cost 1
> minutes.
>     However, I use Beam example code WordCount processing the same file,
> cost 30minutes.
>     My Spark parameter is :  --deploy-mode client  --executor-memory 1g
> --num-executors 1 --driver-memory 1g
>     My Spark version is 2.3.1,  Beam version is 2.5
>     Is there any optimization method?
> Thank you.
> 
>    

-- 
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: How to optimize the performance of Beam on Spark

2018-09-18 Thread Tim Robertson
Hi devinduan

The known issues Robert links there are actually HDFS related and not
specific to Spark.  The improvement we're seeking is that the final copy of
the output file can be optimised by using a "move" instead of "copy" andI
expect to have it fixed for Beam 2.8.0.  On a small dataset like this
though, I don't think it will impact performance too much.

Can you please elaborate on your deployment?  It looks like you are using a
cluster (i.e. deploy-mode client) but are you using HDFS?

I have access to a Cloudera CDH 5.12 Hadoop cluster and just ran an example
word count as follows - I'll explain the parameters to tune below:

1) I generate some random data (using common Hadoop tools)
hadoop jar
/opt/cloudera/parcels/CDH/lib/hadoop-0.20-mapreduce/hadoop-examples.jar \
  teragen \
  -Dmapred.map.tasks=100 \
  -Dmapred.map.tasks.speculative.execution=false \
  1000  \
  /tmp/tera

This puts 100 files totalling just under 1GB on which I will run the word
count. They are stored in the HDFS filesystem.

2) Run the word count using Spark (2.3.x) and Beam 2.5.0

In my cluster I have YARN to allocate resources, and an HDFS filesystem.
This will be different if you run Spark as standalone, or on a cloud
environment.

spark2-submit \
  --conf spark.default.parallelism=45 \
  --class org.apache.beam.runners.spark.examples.WordCount \
  --master yarn \
  --executor-memory 2G \
  --executor-cores 5 \
  --num-executors 9 \
  --jars
beam-sdks-java-core-2.5.0.jar,beam-runners-core-construction-java-2.5.0.jar,beam-runners-core-java-2.5.0.jar,beam-sdks-java-io-hadoop-file-system-2.5.0.jar
\
  beam-runners-spark-2.5.0.jar \
  --runner=SparkRunner \
  --inputFile=hdfs:///tmp/tera/* \
  --output=hdfs:///tmp/wordcount

The jars I provide here are the minimum needed for running on HDFS with
Spark and normally you'd build those into your project as an über jar.

The important bits for tuning for performance are the following - these
will be applicable for any Spark deployment (unless embedded):

  spark.default.parallelism - controls the parallelism of the beam
pipeline. In this case, how many workers are tokenizing the input data.
  executor-memory, executor-cores, num-executors - controls the resources
spark will use

Note, that the parallelism of 45 means that the 5 cores in the 9 executors
can all run concurrently (i.e. 5x9 = 45). When you get to very large
datasets, you will likely have parallelism much higher.

In this test I see around 20 seconds initial startup of Spark (copying
jars, requesting resources from YARN, establishing the Spark context) but
once up the job completes in a few seconds writing the output into 45 files
(because of the parallelism). The files are named
/tmp/wordcount-000*-of-00045.

I hope this helps provide a few pointers, but if you elaborate on your
environment we might be able to assist more.

Best wishes,
Tim













On Tue, Sep 18, 2018 at 9:29 AM Robert Bradshaw  wrote:

> There are known performance issues with Beam on Spark that are being
> worked on, e.g. https://issues.apache.org/jira/browse/BEAM-5036 . It's
> possible you're hitting something different, but would be worth
> investigating. See also
> https://lists.apache.org/list.html?dev@beam.apache.org:lte=1M:Performance%20of%20write
>
> On Tue, Sep 18, 2018 at 8:39 AM devinduan(段丁瑞) 
> wrote:
>
>> Hi,
>> I'm testing Beam on Spark.
>> I use spark example code WordCount processing 1G data file, cost 1
>> minutes.
>> However, I use Beam example code WordCount processing the same file,
>> cost 30minutes.
>> My Spark parameter is :  --deploy-mode client  --executor-memory 1g
>> --num-executors 1 --driver-memory 1g
>> My Spark version is 2.3.1,  Beam version is 2.5
>> Is there any optimization method?
>> Thank you.
>>
>>
>>
>


Re: How to optimize the performance of Beam on Spark

2018-09-18 Thread Robert Bradshaw
There are known performance issues with Beam on Spark that are being worked
on, e.g. https://issues.apache.org/jira/browse/BEAM-5036 . It's possible
you're hitting something different, but would be worth investigating. See
also
https://lists.apache.org/list.html?dev@beam.apache.org:lte=1M:Performance%20of%20write

On Tue, Sep 18, 2018 at 8:39 AM devinduan(段丁瑞) 
wrote:

> Hi,
> I'm testing Beam on Spark.
> I use spark example code WordCount processing 1G data file, cost 1
> minutes.
> However, I use Beam example code WordCount processing the same file,
> cost 30minutes.
> My Spark parameter is :  --deploy-mode client  --executor-memory 1g
> --num-executors 1 --driver-memory 1g
> My Spark version is 2.3.1,  Beam version is 2.5
> Is there any optimization method?
> Thank you.
>
>
>


How to optimize the performance of Beam on Spark

2018-09-18 Thread 段丁瑞
Hi,
I'm testing Beam on Spark.
I use spark example code WordCount processing 1G data file, cost 1 minutes.
However, I use Beam example code WordCount processing the same file, cost 
30minutes.
My Spark parameter is :  --deploy-mode client  --executor-memory 1g 
--num-executors 1 --driver-memory 1g
My Spark version is 2.3.1,  Beam version is 2.5
Is there any optimization method?
Thank you.