Thanks Russel.

Yep, I meant SparkContext (or SparkSession in Spark2).
It's not only about startup time of a spark session (30 seconds delay for
each taks is still a lot).

Another benfit is - it'll be super useful if we decide to cache() a
dataframe.
Then this could be a huge gain for other tasks that may need that (cached)
dataframe.

Without an option to share spark session, each dag's task has to
(1) restart spark context,
and (2) recache dataframes needed in a workflow.
It'll be a major slowdown for a Spark job.

>> but an InteractiveSparkContext or SparkConsoleContext might be able to
do this?

I couldn't find InteractiveSparkContext or SparkConsoleContext in Airflow
nor in Spark .. please elaborate.

Thanks again.



-- 
Ruslan Dautkhanov

On Sat, Mar 18, 2017 at 7:43 PM, Russell Jurney <[email protected]>
wrote:

> Ruslan, thanks for your feedback.
>
> You mean the spark-submit context? Or like the SparkContext and
> SparkSession? I don't think we could keep that alive, because it wouldn't
> work out with multiple calls to spark-submit. I do feel your pain, though.
> Maybe someone else can see how this might be done?
>
> If SparkContext was able to open the spark/pyspark console, then multiple
> job submissions would be possible. I didn't have this in mind but an
> InteractiveSparkContext or SparkConsoleContext might be able to do this?
>
> Russell Jurney @rjurney <http://twitter.com/rjurney>
> [email protected] LI <http://linkedin.com/in/russelljurney> FB
> <http://facebook.com/jurney> datasyndrome.com
>
> On Sat, Mar 18, 2017 at 3:02 PM, Ruslan Dautkhanov <[email protected]>
> wrote:
>
> > +1 Great idea.
> >
> > my two cents - it would be nice (as an option) if SparkOperator would be
> > able to keep context open between different calls,
> > as it takes 30+ seconds to create a new context (on our cluster). Not
> sure
> > how well it fits Airflow architecture.
> >
> >
> >
> > --
> > Ruslan Dautkhanov
> >
> > On Sat, Mar 18, 2017 at 3:45 PM, Russell Jurney <
> [email protected]>
> > wrote:
> >
> > > What do people think about creating a SparkOperator that uses
> > spark-submit
> > > to submit jobs? Would work for Scala/Java Spark and PySpark. The
> patterns
> > > outlined in my presentation on Airflow and PySpark
> > > <http://bit.ly/airflow_pyspark> would fit well inside an Operator, I
> > > think.
> > > BashOperator works, but why not tailor something to spark-submit?
> > >
> > > I'm open to doing the work, but I wanted to see what people though
> about
> > it
> > > and get feedback about things they would like to see in SparkOperator
> and
> > > get any pointers people had to doing the implementation.
> > >
> > > Russell Jurney @rjurney <http://twitter.com/rjurney>
> > > [email protected] LI <http://linkedin.com/in/russelljurney> FB
> > > <http://facebook.com/jurney> datasyndrome.com
> > >
> >
>

Reply via email to