Re: Re: How to optimize the performance of Beam on Spark(Internet mail)

2018-09-19 Thread
Got it.
I will also set "spark.dynamicAllocation.enabled=true" to test.


From: Tim Robertson<mailto:timrobertson...@gmail.com>
Date: 2018-09-19 17:04
To: dev@beam.apache.org<mailto:dev@beam.apache.org>
CC: j...@nanthrax.net<mailto:j...@nanthrax.net>
Subject: Re: Re: How to optimize the performance of Beam on Spark(Internet mail)
Thank you Devin

Can you also please try Beam with more spark executors if you are able?

On Wed, Sep 19, 2018 at 10:47 AM devinduan(段丁瑞) 
mailto:devind...@tencent.com>> wrote:
Thanks for your help!
I will test other examples of Beam On Spark in the future and then feed back 
the results.
Regards
devin

From: Jean-Baptiste Onofré<mailto:j...@nanthrax.net>
Date: 2018-09-19 16:32
To: devinduan(段丁瑞)<mailto:devind...@tencent.com>; 
dev<mailto:dev@beam.apache.org>
Subject: Re: How to optimize the performance of Beam on Spark(Internet mail)

Thanks for the details.

I will take a look later tomorrow (I have another issue to investigate
on the Spark runner today for Beam 2.7.0 release).

Regards
JB

On 19/09/2018 08:31, devinduan(段丁瑞) wrote:
> Hi,
> I test 300MB data file.
> Use command like:
> ./spark-submit --master yarn --deploy-mode client  --class
> com.test.BeamTest --executor-memory 1g --num-executors 1 --driver-memory 1g
>
>  I set only one exeuctor. so task run in sequence . One task cost 10s.
> However, Spark task cost only 0.4s
>
>
>
> *From:* Jean-Baptiste Onofré <mailto:j...@nanthrax.net>
> *Date:* 2018-09-19 12:22
> *To:* dev@beam.apache.org<mailto:dev@beam.apache.org> 
> <mailto:dev@beam.apache.org>
> *Subject:* Re: How to optimize the performance of Beam on
> Spark(Internet mail)
>
> Hi,
>
> did you compare the stages in the Spark UI in order to identify which
> stage is taking time ?
>
> You use spark-submit in both cases for the bootstrapping ?
>
> I will do a test here as well.
>
> Regards
> JB
>
> On 19/09/2018 05:34, devinduan(段丁瑞) wrote:
> > Hi,
> > Thanks for you reply.
> > Our team plan to use Beam instead of Spark, So I'm testing the
> > performance of Beam API.
> > I'm coding some example through Spark API and Beam API , like
> > "WordCount" , "Join",  "OrderBy",  "Union" ...
> > I use the same Resources and configuration to run these Job.
> >Tim said I should remove "withNumShards(1)" and
> > set spark.default.parallelism=32. I did it and tried again, but
> Beam job
> > still running very slowly.
> > Here is My Beam code and Spark code:
> >Beam "WordCount":
> >
> >Spark "WordCount":
> >
> >I will try the other example later.
> >
> > Regards
> > devin
> >
> >
> > *From:* Jean-Baptiste Onofré <mailto:j...@nanthrax.net>
> > *Date:* 2018-09-18 22:43
> > *To:* dev@beam.apache.org<mailto:dev@beam.apache.org> 
> <mailto:dev@beam.apache.org>
> > *Subject:* Re: How to optimize the performance of Beam on
> > Spark(Internet mail)
> >
> > Hi,
> >
> > The first huge difference is the fact that the spark runner
> still uses
> > RDD whereas directly using spark, you are using dataset. A
> bunch of
> > optimization in spark are related to dataset.
> >
> > I started a large refactoring of the spark runner to leverage
> Spark 2.x
> > (and dataset).
> > It's not yet ready as it includes other improvements (the
> portability
> > layer with Job API, a first check of state API, ...).
> >
> > Anyway, by Spark wordcount, you mean the one included in the spark
> > distribution ?
> >
> > Regards
> > JB
> >
> > On 18/09/2018 08:39, devinduan(段丁瑞) wrote:
> > > Hi,
> > > I'm testing Beam on Spark.
> > > I use spark example code WordCount processing 1G data
> file, cost 1
> > > minutes.
> > > However, I use Beam example code WordCount processing
> the same
> > file,
> > > cost 30minutes.
> > > My Spark parameter is :  --deploy-mode client
> >  --executor-memory 1g
> > > --num-executors 1 --driver-memory 1g
> > > My Spark version is 2.3.1,  Beam version is 2.5
> > > Is there any optimization method?
> > > Thank you.
> > >
> > >
> >
> > --
> > Jean-Baptiste Onofré
> > jbono...@apache.org<mailto:jbono...@apache.org>
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
> >
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org<mailto:jbono...@apache.org>
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

--
Jean-Baptiste Onofré
jbono...@apache.org<mailto:jbono...@apache.org>
http://blog.nanthrax.net
Talend - http://www.talend.com



Re: Re: How to optimize the performance of Beam on Spark(Internet mail)

2018-09-19 Thread
Thanks for your help!
I will test other examples of Beam On Spark in the future and then feed back 
the results.
Regards
devin

From: Jean-Baptiste Onofré<mailto:j...@nanthrax.net>
Date: 2018-09-19 16:32
To: devinduan(段丁瑞)<mailto:devind...@tencent.com>; 
dev<mailto:dev@beam.apache.org>
Subject: Re: How to optimize the performance of Beam on Spark(Internet mail)

Thanks for the details.

I will take a look later tomorrow (I have another issue to investigate
on the Spark runner today for Beam 2.7.0 release).

Regards
JB

On 19/09/2018 08:31, devinduan(段丁瑞) wrote:
> Hi,
> I test 300MB data file.
> Use command like:
> ./spark-submit --master yarn --deploy-mode client  --class
> com.test.BeamTest --executor-memory 1g --num-executors 1 --driver-memory 1g
>
>  I set only one exeuctor. so task run in sequence . One task cost 10s.
> However, Spark task cost only 0.4s
>
>
>
> *From:* Jean-Baptiste Onofré <mailto:j...@nanthrax.net>
> *Date:* 2018-09-19 12:22
> *To:* dev@beam.apache.org <mailto:dev@beam.apache.org>
> *Subject:* Re: How to optimize the performance of Beam on
> Spark(Internet mail)
>
> Hi,
>
> did you compare the stages in the Spark UI in order to identify which
> stage is taking time ?
>
> You use spark-submit in both cases for the bootstrapping ?
>
> I will do a test here as well.
>
> Regards
> JB
>
> On 19/09/2018 05:34, devinduan(段丁瑞) wrote:
> > Hi,
> > Thanks for you reply.
> > Our team plan to use Beam instead of Spark, So I'm testing the
> > performance of Beam API.
> > I'm coding some example through Spark API and Beam API , like
> > "WordCount" , "Join",  "OrderBy",  "Union" ...
> > I use the same Resources and configuration to run these Job.
> >Tim said I should remove "withNumShards(1)" and
> > set spark.default.parallelism=32. I did it and tried again, but
> Beam job
> > still running very slowly.
> > Here is My Beam code and Spark code:
> >Beam "WordCount":
> >
> >Spark "WordCount":
> >
> >I will try the other example later.
> >
> > Regards
> > devin
> >
> >
> > *From:* Jean-Baptiste Onofré <mailto:j...@nanthrax.net>
> > *Date:* 2018-09-18 22:43
> > *To:* dev@beam.apache.org <mailto:dev@beam.apache.org>
> > *Subject:* Re: How to optimize the performance of Beam on
> > Spark(Internet mail)
> >
> > Hi,
> >
> > The first huge difference is the fact that the spark runner
> still uses
> > RDD whereas directly using spark, you are using dataset. A
> bunch of
> > optimization in spark are related to dataset.
> >
> > I started a large refactoring of the spark runner to leverage
> Spark 2.x
> > (and dataset).
> > It's not yet ready as it includes other improvements (the
> portability
> > layer with Job API, a first check of state API, ...).
> >
> > Anyway, by Spark wordcount, you mean the one included in the spark
> > distribution ?
> >
> > Regards
> > JB
> >
> > On 18/09/2018 08:39, devinduan(段丁瑞) wrote:
> > > Hi,
> > > I'm testing Beam on Spark.
> > > I use spark example code WordCount processing 1G data
> file, cost 1
> > > minutes.
> > > However, I use Beam example code WordCount processing
> the same
> > file,
> > > cost 30minutes.
> > > My Spark parameter is :  --deploy-mode client
> >  --executor-memory 1g
> > > --num-executors 1 --driver-memory 1g
> > > My Spark version is 2.3.1,  Beam version is 2.5
> > > Is there any optimization method?
> > > Thank you.
> > >
> > >
> >
> > --
> > Jean-Baptiste Onofré
> > jbono...@apache.org
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
> >
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com



How to optimize the performance of Beam on Spark

2018-09-18 Thread
Hi,
I'm testing Beam on Spark.
I use spark example code WordCount processing 1G data file, cost 1 minutes.
However, I use Beam example code WordCount processing the same file, cost 
30minutes.
My Spark parameter is :  --deploy-mode client  --executor-memory 1g 
--num-executors 1 --driver-memory 1g
My Spark version is 2.3.1,  Beam version is 2.5
Is there any optimization method?
Thank you.




How to implement repartition.

2018-09-11 Thread
Hi all:
I recently start studying the Beam on spark runner.
I want to implement a method repartition similar to Spark rdd.repartition() 
, but I can't find a solution.
Could anyone help me?
Thanks for your reply.
devin.