Thanks Bolke. That's awesome. 1) So each task would creates its own spark session? Is there is a way to have spark session sharing like discussed in this email chain?
2) Looks like SparkSqlHook calls `spark-sql` shell with all those parameters? https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/hooks/spark_sql_hook.py#L88 This probably will not work in Cloudera's distribution of Spark. I think they stopped shipping `spark-sql` since CDH 5.4. `spark-sql` is not included because CDH Spark doesn't have Thift service or because of some other reason. Thank you. -- Ruslan Dautkhanov On Sat, Mar 18, 2017 at 8:24 PM, Bolke de Bruin <[email protected]> wrote: > A spark operator exists as of 1.8.0 (which will be released tomorrow), you > might want to take a look at that. I know that an update is coming to that > operator that improves communication with Yarn. > > Bolke > > > On 18 Mar 2017, at 18:43, Russell Jurney <[email protected]> > wrote: > > > > Ruslan, thanks for your feedback. > > > > You mean the spark-submit context? Or like the SparkContext and > > SparkSession? I don't think we could keep that alive, because it wouldn't > > work out with multiple calls to spark-submit. I do feel your pain, > though. > > Maybe someone else can see how this might be done? > > > > If SparkContext was able to open the spark/pyspark console, then multiple > > job submissions would be possible. I didn't have this in mind but an > > InteractiveSparkContext or SparkConsoleContext might be able to do this? > > > > Russell Jurney @rjurney <http://twitter.com/rjurney> > > [email protected] LI <http://linkedin.com/in/russelljurney> FB > > <http://facebook.com/jurney> datasyndrome.com > > > > On Sat, Mar 18, 2017 at 3:02 PM, Ruslan Dautkhanov <[email protected] > > > > wrote: > > > >> +1 Great idea. > >> > >> my two cents - it would be nice (as an option) if SparkOperator would be > >> able to keep context open between different calls, > >> as it takes 30+ seconds to create a new context (on our cluster). Not > sure > >> how well it fits Airflow architecture. > >> > >> > >> > >> -- > >> Ruslan Dautkhanov > >> > >> On Sat, Mar 18, 2017 at 3:45 PM, Russell Jurney < > [email protected]> > >> wrote: > >> > >>> What do people think about creating a SparkOperator that uses > >> spark-submit > >>> to submit jobs? Would work for Scala/Java Spark and PySpark. The > patterns > >>> outlined in my presentation on Airflow and PySpark > >>> <http://bit.ly/airflow_pyspark> would fit well inside an Operator, I > >>> think. > >>> BashOperator works, but why not tailor something to spark-submit? > >>> > >>> I'm open to doing the work, but I wanted to see what people though > about > >> it > >>> and get feedback about things they would like to see in SparkOperator > and > >>> get any pointers people had to doing the implementation. > >>> > >>> Russell Jurney @rjurney <http://twitter.com/rjurney> > >>> [email protected] LI <http://linkedin.com/in/russelljurney> FB > >>> <http://facebook.com/jurney> datasyndrome.com > >>> > >> > >
