Re: Persisting driver logs in yarn client mode (SPARK-25118)

2018-08-27 Thread Henry Robinson
On Mon, 27 Aug 2018 at 13:04, Ankur Gupta 
wrote:

> Thanks all for your responses.
>
> So I believe a solution that accomplishes the following will be a good
> solution:
>
> 1. Writes logs to Hdfs asynchronously
>

In the limit, this could perform just as slowly at shutdown time as
synchronous logging (imagine a job produces a huge amount of log output and
immediately completes). Will you plan to wait for the logging to complete,
wait up to some maximum time, or just exit quickly no matter how much log
shipping has been done?


> 2. Writes logs at INFO level while ensuring that console logs are written
> at WARN level by default (in shell mode)
> 3. Optionally, moves this file to Yarn's Remote Application Dir (to ensure
> that shutdown operation does not slow down significantly)
>
> If this resolves all the concerns, then I can work on a PR to add this
> functionality.
>
> On Fri, Aug 24, 2018 at 3:12 PM Marcelo Vanzin 
> wrote:
>
>> I think this would be useful, but I also share Saisai's and Marco's
>> concern about the extra step when shutting down the application. If
>> that could be minimized this would be a much more interesting feature.
>>
>> e.g. you could upload logs incrementally to HDFS, asynchronously,
>> while the app is running. Or you could pipe them to the YARN AM over
>> Spark's RPC (losing some logs in  the beginning and end of the driver
>> execution). Or maybe something else.
>>
>> There is also the issue of shell logs being at "warn" level by
>> default, so even if you write these to a file, they're not really that
>> useful for debugging. So a solution than keeps that behavior, but
>> writes INFO logs to this new sink, would be great.
>>
>> If you can come up with a solution to those problems I think this
>> could be a good feature.
>>
>>
>> On Wed, Aug 22, 2018 at 10:01 AM, Ankur Gupta
>>  wrote:
>> > Thanks for your responses Saisai and Marco.
>> >
>> > I agree that "rename" operation can be time-consuming on object storage,
>> > which can potentially delay the shutdown.
>> >
>> > I also agree that customers/users have a way to use log appenders to
>> write
>> > log files and then send them along with Yarn application logs but I
>> still
>> > think it is a cumbersome process. Also, there is the issue that
>> customers
>> > cannot easily identify which logs belong to which application, without
>> > reading the log file. And if users run multiple applications with
>> default
>> > log4j configurations on the same host, then they can end up writing to
>> the
>> > same log file.
>> >
>> > Because of the issues mentioned above, we can maybe think of this as an
>> > optional feature, which will be disabled by default but turned on by
>> > customers. This will solve the problems mentioned above, reduce the
>> overhead
>> > on users/customers while adding a bit of overhead during the shutdown
>> phase
>> > of Spark Application.
>> >
>> > Thanks,
>> > Ankur
>> >
>> > On Wed, Aug 22, 2018 at 1:36 AM Marco Gaido 
>> wrote:
>> >>
>> >> I agree with Saisai. You can also configure log4j to append anywhere
>> else
>> >> other than the console. Many companies have their system for
>> collecting and
>> >> monitoring logs and they just customize the log4j configuration. I am
>> not
>> >> sure how needed this change would be.
>> >>
>> >> Thanks,
>> >> Marco
>> >>
>> >> Il giorno mer 22 ago 2018 alle ore 04:31 Saisai Shao
>> >>  ha scritto:
>> >>>
>> >>> One issue I can think of is that this "moving the driver log" in the
>> >>> application end is quite time-consuming, which will significantly
>> delay the
>> >>> shutdown. We already suffered such "rename" problem for event log on
>> object
>> >>> store, the moving of driver log will make the problem severe.
>> >>>
>> >>> For a vanilla Spark on yarn client application, I think user could
>> >>> redirect the console outputs to log and provides both driver log and
>> yarn
>> >>> application log to the customers, this seems not a big overhead.
>> >>>
>> >>> Just my two cents.
>> >>>
>> >>> Thanks
>> >>> Saisai
>> >>>
>> >>> Ankur Gupta  于2018年8月22日周三
>> 上午5:19写道:
>> 
>>  Hi all,
>> 
>>  I want to highlight a problem that we face here at Cloudera and
>> start a
>>  discussion on how to go about solving it.
>> 
>>  Problem Statement:
>>  Our customers reach out to us when they face problems in their Spark
>>  Applications. Those problems can be related to Spark, environment
>> issues,
>>  their own code or something else altogether. A lot of times these
>> customers
>>  run their Spark Applications in Yarn Client mode, which as we all
>> know, uses
>>  a ConsoleAppender to print logs to the console. These customers
>> usually send
>>  their Yarn logs to us to troubleshoot. As you may have figured,
>> these logs
>>  do not contain driver logs and makes it difficult for us to
>> troubleshoot the
>>  issue. In that scenario our customers end up running the application
>> again,
>>  piping the 

Re: Persisting driver logs in yarn client mode (SPARK-25118)

2018-08-27 Thread Ankur Gupta
Thanks all for your responses.

So I believe a solution that accomplishes the following will be a good
solution:

1. Writes logs to Hdfs asynchronously
2. Writes logs at INFO level while ensuring that console logs are written
at WARN level by default (in shell mode)
3. Optionally, moves this file to Yarn's Remote Application Dir (to ensure
that shutdown operation does not slow down significantly)

If this resolves all the concerns, then I can work on a PR to add this
functionality.

On Fri, Aug 24, 2018 at 3:12 PM Marcelo Vanzin 
wrote:

> I think this would be useful, but I also share Saisai's and Marco's
> concern about the extra step when shutting down the application. If
> that could be minimized this would be a much more interesting feature.
>
> e.g. you could upload logs incrementally to HDFS, asynchronously,
> while the app is running. Or you could pipe them to the YARN AM over
> Spark's RPC (losing some logs in  the beginning and end of the driver
> execution). Or maybe something else.
>
> There is also the issue of shell logs being at "warn" level by
> default, so even if you write these to a file, they're not really that
> useful for debugging. So a solution than keeps that behavior, but
> writes INFO logs to this new sink, would be great.
>
> If you can come up with a solution to those problems I think this
> could be a good feature.
>
>
> On Wed, Aug 22, 2018 at 10:01 AM, Ankur Gupta
>  wrote:
> > Thanks for your responses Saisai and Marco.
> >
> > I agree that "rename" operation can be time-consuming on object storage,
> > which can potentially delay the shutdown.
> >
> > I also agree that customers/users have a way to use log appenders to
> write
> > log files and then send them along with Yarn application logs but I still
> > think it is a cumbersome process. Also, there is the issue that customers
> > cannot easily identify which logs belong to which application, without
> > reading the log file. And if users run multiple applications with default
> > log4j configurations on the same host, then they can end up writing to
> the
> > same log file.
> >
> > Because of the issues mentioned above, we can maybe think of this as an
> > optional feature, which will be disabled by default but turned on by
> > customers. This will solve the problems mentioned above, reduce the
> overhead
> > on users/customers while adding a bit of overhead during the shutdown
> phase
> > of Spark Application.
> >
> > Thanks,
> > Ankur
> >
> > On Wed, Aug 22, 2018 at 1:36 AM Marco Gaido 
> wrote:
> >>
> >> I agree with Saisai. You can also configure log4j to append anywhere
> else
> >> other than the console. Many companies have their system for collecting
> and
> >> monitoring logs and they just customize the log4j configuration. I am
> not
> >> sure how needed this change would be.
> >>
> >> Thanks,
> >> Marco
> >>
> >> Il giorno mer 22 ago 2018 alle ore 04:31 Saisai Shao
> >>  ha scritto:
> >>>
> >>> One issue I can think of is that this "moving the driver log" in the
> >>> application end is quite time-consuming, which will significantly
> delay the
> >>> shutdown. We already suffered such "rename" problem for event log on
> object
> >>> store, the moving of driver log will make the problem severe.
> >>>
> >>> For a vanilla Spark on yarn client application, I think user could
> >>> redirect the console outputs to log and provides both driver log and
> yarn
> >>> application log to the customers, this seems not a big overhead.
> >>>
> >>> Just my two cents.
> >>>
> >>> Thanks
> >>> Saisai
> >>>
> >>> Ankur Gupta  于2018年8月22日周三 上午5:19写道:
> 
>  Hi all,
> 
>  I want to highlight a problem that we face here at Cloudera and start
> a
>  discussion on how to go about solving it.
> 
>  Problem Statement:
>  Our customers reach out to us when they face problems in their Spark
>  Applications. Those problems can be related to Spark, environment
> issues,
>  their own code or something else altogether. A lot of times these
> customers
>  run their Spark Applications in Yarn Client mode, which as we all
> know, uses
>  a ConsoleAppender to print logs to the console. These customers
> usually send
>  their Yarn logs to us to troubleshoot. As you may have figured, these
> logs
>  do not contain driver logs and makes it difficult for us to
> troubleshoot the
>  issue. In that scenario our customers end up running the application
> again,
>  piping the output to a log file or using a local log appender and then
>  sending over that file.
> 
>  I believe that there are other users in the community who also face
>  similar problem, where the central team managing Spark clusters face
>  difficulty in helping the end users because they ran their
> application in
>  shell or yarn client mode (I am not sure what is the equivalent in
> Mesos).
> 
>  Additionally, there may be teams who want to capture all these logs so
>  that they can 

Re: SparkContext singleton get w/o create?

2018-08-27 Thread Andrew Melo
Hi,

I'm a long-time listener, first-time committer to spark, so this is
good to get my feet wet. I'm particularly interested in SPARK-23836,
which is an itch I may want to dive into and scratch myself in the
next month or so since it's pretty painful for our use-case.

Thanks!
Andrew

On Mon, Aug 27, 2018 at 2:20 PM, Holden Karau  wrote:
> Sure, I don't think you should wait on that being merged in. If you want to
> take the JIRA go ahead (although if you're already familiar with the Spark
> code base it might make sense to leave it as a starter issue for someone who
> is just getting started).
>
> On Mon, Aug 27, 2018 at 12:18 PM Andrew Melo  wrote:
>>
>> Hi Holden,
>>
>> I'm agnostic to the approach (though it seems cleaner to have an
>> explicit API for it). If you would like, I can take that JIRA and
>> implement it (should be a 3-line function).
>>
>> Cheers
>> Andrew
>>
>> On Mon, Aug 27, 2018 at 2:14 PM, Holden Karau 
>> wrote:
>> > Seems reasonable. We should probably add `getActiveSession` to the
>> > PySpark
>> > API (filed a starter JIRA
>> > https://issues.apache.org/jira/browse/SPARK-25255
>> > )
>> >
>> > On Mon, Aug 27, 2018 at 12:09 PM Andrew Melo 
>> > wrote:
>> >>
>> >> Hello Sean, others -
>> >>
>> >> Just to confirm, is it OK for client applications to access
>> >> SparkContext._active_spark_context, if it wraps the accesses in `with
>> >> SparkContext._lock:`?
>> >>
>> >> If that's acceptable to Spark, I'll implement the modifications in the
>> >> Jupyter extensions.
>> >>
>> >> thanks!
>> >> Andrew
>> >>
>> >> On Tue, Aug 7, 2018 at 5:52 PM, Andrew Melo 
>> >> wrote:
>> >> > Hi Sean,
>> >> >
>> >> > On Tue, Aug 7, 2018 at 5:44 PM, Sean Owen  wrote:
>> >> >> Ah, python.  How about SparkContext._active_spark_context then?
>> >> >
>> >> > Ah yes, that looks like the right member, but I'm a bit wary about
>> >> > depending on functionality of objects with leading underscores. I
>> >> > assumed that was "private" and subject to change. Is that something I
>> >> > should be unconcerned about.
>> >> >
>> >> > The other thought is that the accesses with SparkContext are
>> >> > protected
>> >> > by "SparkContext._lock" -- should I also use that lock?
>> >> >
>> >> > Thanks for your help!
>> >> > Andrew
>> >> >
>> >> >>
>> >> >> On Tue, Aug 7, 2018 at 5:34 PM Andrew Melo 
>> >> >> wrote:
>> >> >>>
>> >> >>> Hi Sean,
>> >> >>>
>> >> >>> On Tue, Aug 7, 2018 at 5:16 PM, Sean Owen  wrote:
>> >> >>> > Is SparkSession.getActiveSession what you're looking for?
>> >> >>>
>> >> >>> Perhaps -- though there's not a corresponding python function, and
>> >> >>> I'm
>> >> >>> not exactly sure how to call the scala getActiveSession without
>> >> >>> first
>> >> >>> instantiating the python version and causing a JVM to start.
>> >> >>>
>> >> >>> Is there an easy way to call getActiveSession that doesn't start a
>> >> >>> JVM?
>> >> >>>
>> >> >>> Cheers
>> >> >>> Andrew
>> >> >>>
>> >> >>> >
>> >> >>> > On Tue, Aug 7, 2018 at 5:11 PM Andrew Melo
>> >> >>> > 
>> >> >>> > wrote:
>> >> >>> >>
>> >> >>> >> Hello,
>> >> >>> >>
>> >> >>> >> One pain point with various Jupyter extensions [1][2] that
>> >> >>> >> provide
>> >> >>> >> visual feedback about running spark processes is the lack of a
>> >> >>> >> public
>> >> >>> >> API to introspect the web URL. The notebook server needs to know
>> >> >>> >> the
>> >> >>> >> URL to find information about the current SparkContext.
>> >> >>> >>
>> >> >>> >> Simply looking for "localhost:4040" works most of the time, but
>> >> >>> >> fails
>> >> >>> >> if multiple spark notebooks are being run on the same host --
>> >> >>> >> spark
>> >> >>> >> increments the port for each new context, leading to confusion
>> >> >>> >> when
>> >> >>> >> the notebooks are trying to probe the web interface for
>> >> >>> >> information.
>> >> >>> >>
>> >> >>> >> I'd like to implement an analog to SparkContext.getOrCreate(),
>> >> >>> >> perhaps
>> >> >>> >> called "getIfExists()" that returns the current singleton if it
>> >> >>> >> exists, or None otherwise. The Jupyter code would then be able
>> >> >>> >> to
>> >> >>> >> use
>> >> >>> >> this entrypoint to query Spark for an active Spark context,
>> >> >>> >> which
>> >> >>> >> it
>> >> >>> >> could use to probe the web URL.
>> >> >>> >>
>> >> >>> >> It's a minor change, but this would be my first contribution to
>> >> >>> >> Spark,
>> >> >>> >> and I want to make sure my plan was kosher before I implemented
>> >> >>> >> it.
>> >> >>> >>
>> >> >>> >> Thanks!
>> >> >>> >> Andrew
>> >> >>> >>
>> >> >>> >>
>> >> >>> >>
>> >> >>> >>
>> >> >>> >>
>> >> >>> >> [1] https://krishnan-r.github.io/sparkmonitor/
>> >> >>> >>
>> >> >>> >> [2] https://github.com/mozilla/jupyter-spark
>> >> >>> >>
>> >> >>> >>
>> >> >>> >>
>> >> >>> >> -
>> >> >>> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> >> >>> >>
>> >> >>> >
>> >>
>> >> 

Re: SparkContext singleton get w/o create?

2018-08-27 Thread Holden Karau
Sure, I don't think you should wait on that being merged in. If you want to
take the JIRA go ahead (although if you're already familiar with the Spark
code base it might make sense to leave it as a starter issue for someone
who is just getting started).

On Mon, Aug 27, 2018 at 12:18 PM Andrew Melo  wrote:

> Hi Holden,
>
> I'm agnostic to the approach (though it seems cleaner to have an
> explicit API for it). If you would like, I can take that JIRA and
> implement it (should be a 3-line function).
>
> Cheers
> Andrew
>
> On Mon, Aug 27, 2018 at 2:14 PM, Holden Karau 
> wrote:
> > Seems reasonable. We should probably add `getActiveSession` to the
> PySpark
> > API (filed a starter JIRA
> https://issues.apache.org/jira/browse/SPARK-25255
> > )
> >
> > On Mon, Aug 27, 2018 at 12:09 PM Andrew Melo 
> wrote:
> >>
> >> Hello Sean, others -
> >>
> >> Just to confirm, is it OK for client applications to access
> >> SparkContext._active_spark_context, if it wraps the accesses in `with
> >> SparkContext._lock:`?
> >>
> >> If that's acceptable to Spark, I'll implement the modifications in the
> >> Jupyter extensions.
> >>
> >> thanks!
> >> Andrew
> >>
> >> On Tue, Aug 7, 2018 at 5:52 PM, Andrew Melo 
> wrote:
> >> > Hi Sean,
> >> >
> >> > On Tue, Aug 7, 2018 at 5:44 PM, Sean Owen  wrote:
> >> >> Ah, python.  How about SparkContext._active_spark_context then?
> >> >
> >> > Ah yes, that looks like the right member, but I'm a bit wary about
> >> > depending on functionality of objects with leading underscores. I
> >> > assumed that was "private" and subject to change. Is that something I
> >> > should be unconcerned about.
> >> >
> >> > The other thought is that the accesses with SparkContext are protected
> >> > by "SparkContext._lock" -- should I also use that lock?
> >> >
> >> > Thanks for your help!
> >> > Andrew
> >> >
> >> >>
> >> >> On Tue, Aug 7, 2018 at 5:34 PM Andrew Melo 
> >> >> wrote:
> >> >>>
> >> >>> Hi Sean,
> >> >>>
> >> >>> On Tue, Aug 7, 2018 at 5:16 PM, Sean Owen  wrote:
> >> >>> > Is SparkSession.getActiveSession what you're looking for?
> >> >>>
> >> >>> Perhaps -- though there's not a corresponding python function, and
> I'm
> >> >>> not exactly sure how to call the scala getActiveSession without
> first
> >> >>> instantiating the python version and causing a JVM to start.
> >> >>>
> >> >>> Is there an easy way to call getActiveSession that doesn't start a
> >> >>> JVM?
> >> >>>
> >> >>> Cheers
> >> >>> Andrew
> >> >>>
> >> >>> >
> >> >>> > On Tue, Aug 7, 2018 at 5:11 PM Andrew Melo  >
> >> >>> > wrote:
> >> >>> >>
> >> >>> >> Hello,
> >> >>> >>
> >> >>> >> One pain point with various Jupyter extensions [1][2] that
> provide
> >> >>> >> visual feedback about running spark processes is the lack of a
> >> >>> >> public
> >> >>> >> API to introspect the web URL. The notebook server needs to know
> >> >>> >> the
> >> >>> >> URL to find information about the current SparkContext.
> >> >>> >>
> >> >>> >> Simply looking for "localhost:4040" works most of the time, but
> >> >>> >> fails
> >> >>> >> if multiple spark notebooks are being run on the same host --
> spark
> >> >>> >> increments the port for each new context, leading to confusion
> when
> >> >>> >> the notebooks are trying to probe the web interface for
> >> >>> >> information.
> >> >>> >>
> >> >>> >> I'd like to implement an analog to SparkContext.getOrCreate(),
> >> >>> >> perhaps
> >> >>> >> called "getIfExists()" that returns the current singleton if it
> >> >>> >> exists, or None otherwise. The Jupyter code would then be able to
> >> >>> >> use
> >> >>> >> this entrypoint to query Spark for an active Spark context, which
> >> >>> >> it
> >> >>> >> could use to probe the web URL.
> >> >>> >>
> >> >>> >> It's a minor change, but this would be my first contribution to
> >> >>> >> Spark,
> >> >>> >> and I want to make sure my plan was kosher before I implemented
> it.
> >> >>> >>
> >> >>> >> Thanks!
> >> >>> >> Andrew
> >> >>> >>
> >> >>> >>
> >> >>> >>
> >> >>> >>
> >> >>> >>
> >> >>> >> [1] https://krishnan-r.github.io/sparkmonitor/
> >> >>> >>
> >> >>> >> [2] https://github.com/mozilla/jupyter-spark
> >> >>> >>
> >> >>> >>
> >> >>> >>
> -
> >> >>> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >> >>> >>
> >> >>> >
> >>
> >> -
> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>
> >
> >
> > --
> > Twitter: https://twitter.com/holdenkarau
> > Books (Learning Spark, High Performance Spark, etc.):
> > https://amzn.to/2MaRAG9
> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: SparkContext singleton get w/o create?

2018-08-27 Thread Andrew Melo
Hi Holden,

I'm agnostic to the approach (though it seems cleaner to have an
explicit API for it). If you would like, I can take that JIRA and
implement it (should be a 3-line function).

Cheers
Andrew

On Mon, Aug 27, 2018 at 2:14 PM, Holden Karau  wrote:
> Seems reasonable. We should probably add `getActiveSession` to the PySpark
> API (filed a starter JIRA https://issues.apache.org/jira/browse/SPARK-25255
> )
>
> On Mon, Aug 27, 2018 at 12:09 PM Andrew Melo  wrote:
>>
>> Hello Sean, others -
>>
>> Just to confirm, is it OK for client applications to access
>> SparkContext._active_spark_context, if it wraps the accesses in `with
>> SparkContext._lock:`?
>>
>> If that's acceptable to Spark, I'll implement the modifications in the
>> Jupyter extensions.
>>
>> thanks!
>> Andrew
>>
>> On Tue, Aug 7, 2018 at 5:52 PM, Andrew Melo  wrote:
>> > Hi Sean,
>> >
>> > On Tue, Aug 7, 2018 at 5:44 PM, Sean Owen  wrote:
>> >> Ah, python.  How about SparkContext._active_spark_context then?
>> >
>> > Ah yes, that looks like the right member, but I'm a bit wary about
>> > depending on functionality of objects with leading underscores. I
>> > assumed that was "private" and subject to change. Is that something I
>> > should be unconcerned about.
>> >
>> > The other thought is that the accesses with SparkContext are protected
>> > by "SparkContext._lock" -- should I also use that lock?
>> >
>> > Thanks for your help!
>> > Andrew
>> >
>> >>
>> >> On Tue, Aug 7, 2018 at 5:34 PM Andrew Melo 
>> >> wrote:
>> >>>
>> >>> Hi Sean,
>> >>>
>> >>> On Tue, Aug 7, 2018 at 5:16 PM, Sean Owen  wrote:
>> >>> > Is SparkSession.getActiveSession what you're looking for?
>> >>>
>> >>> Perhaps -- though there's not a corresponding python function, and I'm
>> >>> not exactly sure how to call the scala getActiveSession without first
>> >>> instantiating the python version and causing a JVM to start.
>> >>>
>> >>> Is there an easy way to call getActiveSession that doesn't start a
>> >>> JVM?
>> >>>
>> >>> Cheers
>> >>> Andrew
>> >>>
>> >>> >
>> >>> > On Tue, Aug 7, 2018 at 5:11 PM Andrew Melo 
>> >>> > wrote:
>> >>> >>
>> >>> >> Hello,
>> >>> >>
>> >>> >> One pain point with various Jupyter extensions [1][2] that provide
>> >>> >> visual feedback about running spark processes is the lack of a
>> >>> >> public
>> >>> >> API to introspect the web URL. The notebook server needs to know
>> >>> >> the
>> >>> >> URL to find information about the current SparkContext.
>> >>> >>
>> >>> >> Simply looking for "localhost:4040" works most of the time, but
>> >>> >> fails
>> >>> >> if multiple spark notebooks are being run on the same host -- spark
>> >>> >> increments the port for each new context, leading to confusion when
>> >>> >> the notebooks are trying to probe the web interface for
>> >>> >> information.
>> >>> >>
>> >>> >> I'd like to implement an analog to SparkContext.getOrCreate(),
>> >>> >> perhaps
>> >>> >> called "getIfExists()" that returns the current singleton if it
>> >>> >> exists, or None otherwise. The Jupyter code would then be able to
>> >>> >> use
>> >>> >> this entrypoint to query Spark for an active Spark context, which
>> >>> >> it
>> >>> >> could use to probe the web URL.
>> >>> >>
>> >>> >> It's a minor change, but this would be my first contribution to
>> >>> >> Spark,
>> >>> >> and I want to make sure my plan was kosher before I implemented it.
>> >>> >>
>> >>> >> Thanks!
>> >>> >> Andrew
>> >>> >>
>> >>> >>
>> >>> >>
>> >>> >>
>> >>> >>
>> >>> >> [1] https://krishnan-r.github.io/sparkmonitor/
>> >>> >>
>> >>> >> [2] https://github.com/mozilla/jupyter-spark
>> >>> >>
>> >>> >>
>> >>> >> -
>> >>> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> >>> >>
>> >>> >
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: SparkContext singleton get w/o create?

2018-08-27 Thread Holden Karau
Seems reasonable. We should probably add `getActiveSession` to the PySpark
API (filed a starter JIRA https://issues.apache.org/jira/browse/SPARK-25255
)

On Mon, Aug 27, 2018 at 12:09 PM Andrew Melo  wrote:

> Hello Sean, others -
>
> Just to confirm, is it OK for client applications to access
> SparkContext._active_spark_context, if it wraps the accesses in `with
> SparkContext._lock:`?
>
> If that's acceptable to Spark, I'll implement the modifications in the
> Jupyter extensions.
>
> thanks!
> Andrew
>
> On Tue, Aug 7, 2018 at 5:52 PM, Andrew Melo  wrote:
> > Hi Sean,
> >
> > On Tue, Aug 7, 2018 at 5:44 PM, Sean Owen  wrote:
> >> Ah, python.  How about SparkContext._active_spark_context then?
> >
> > Ah yes, that looks like the right member, but I'm a bit wary about
> > depending on functionality of objects with leading underscores. I
> > assumed that was "private" and subject to change. Is that something I
> > should be unconcerned about.
> >
> > The other thought is that the accesses with SparkContext are protected
> > by "SparkContext._lock" -- should I also use that lock?
> >
> > Thanks for your help!
> > Andrew
> >
> >>
> >> On Tue, Aug 7, 2018 at 5:34 PM Andrew Melo 
> wrote:
> >>>
> >>> Hi Sean,
> >>>
> >>> On Tue, Aug 7, 2018 at 5:16 PM, Sean Owen  wrote:
> >>> > Is SparkSession.getActiveSession what you're looking for?
> >>>
> >>> Perhaps -- though there's not a corresponding python function, and I'm
> >>> not exactly sure how to call the scala getActiveSession without first
> >>> instantiating the python version and causing a JVM to start.
> >>>
> >>> Is there an easy way to call getActiveSession that doesn't start a JVM?
> >>>
> >>> Cheers
> >>> Andrew
> >>>
> >>> >
> >>> > On Tue, Aug 7, 2018 at 5:11 PM Andrew Melo 
> >>> > wrote:
> >>> >>
> >>> >> Hello,
> >>> >>
> >>> >> One pain point with various Jupyter extensions [1][2] that provide
> >>> >> visual feedback about running spark processes is the lack of a
> public
> >>> >> API to introspect the web URL. The notebook server needs to know the
> >>> >> URL to find information about the current SparkContext.
> >>> >>
> >>> >> Simply looking for "localhost:4040" works most of the time, but
> fails
> >>> >> if multiple spark notebooks are being run on the same host -- spark
> >>> >> increments the port for each new context, leading to confusion when
> >>> >> the notebooks are trying to probe the web interface for information.
> >>> >>
> >>> >> I'd like to implement an analog to SparkContext.getOrCreate(),
> perhaps
> >>> >> called "getIfExists()" that returns the current singleton if it
> >>> >> exists, or None otherwise. The Jupyter code would then be able to
> use
> >>> >> this entrypoint to query Spark for an active Spark context, which it
> >>> >> could use to probe the web URL.
> >>> >>
> >>> >> It's a minor change, but this would be my first contribution to
> Spark,
> >>> >> and I want to make sure my plan was kosher before I implemented it.
> >>> >>
> >>> >> Thanks!
> >>> >> Andrew
> >>> >>
> >>> >>
> >>> >>
> >>> >>
> >>> >>
> >>> >> [1] https://krishnan-r.github.io/sparkmonitor/
> >>> >>
> >>> >> [2] https://github.com/mozilla/jupyter-spark
> >>> >>
> >>> >>
> -
> >>> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>> >>
> >>> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: SparkContext singleton get w/o create?

2018-08-27 Thread Andrew Melo
Hello Sean, others -

Just to confirm, is it OK for client applications to access
SparkContext._active_spark_context, if it wraps the accesses in `with
SparkContext._lock:`?

If that's acceptable to Spark, I'll implement the modifications in the
Jupyter extensions.

thanks!
Andrew

On Tue, Aug 7, 2018 at 5:52 PM, Andrew Melo  wrote:
> Hi Sean,
>
> On Tue, Aug 7, 2018 at 5:44 PM, Sean Owen  wrote:
>> Ah, python.  How about SparkContext._active_spark_context then?
>
> Ah yes, that looks like the right member, but I'm a bit wary about
> depending on functionality of objects with leading underscores. I
> assumed that was "private" and subject to change. Is that something I
> should be unconcerned about.
>
> The other thought is that the accesses with SparkContext are protected
> by "SparkContext._lock" -- should I also use that lock?
>
> Thanks for your help!
> Andrew
>
>>
>> On Tue, Aug 7, 2018 at 5:34 PM Andrew Melo  wrote:
>>>
>>> Hi Sean,
>>>
>>> On Tue, Aug 7, 2018 at 5:16 PM, Sean Owen  wrote:
>>> > Is SparkSession.getActiveSession what you're looking for?
>>>
>>> Perhaps -- though there's not a corresponding python function, and I'm
>>> not exactly sure how to call the scala getActiveSession without first
>>> instantiating the python version and causing a JVM to start.
>>>
>>> Is there an easy way to call getActiveSession that doesn't start a JVM?
>>>
>>> Cheers
>>> Andrew
>>>
>>> >
>>> > On Tue, Aug 7, 2018 at 5:11 PM Andrew Melo 
>>> > wrote:
>>> >>
>>> >> Hello,
>>> >>
>>> >> One pain point with various Jupyter extensions [1][2] that provide
>>> >> visual feedback about running spark processes is the lack of a public
>>> >> API to introspect the web URL. The notebook server needs to know the
>>> >> URL to find information about the current SparkContext.
>>> >>
>>> >> Simply looking for "localhost:4040" works most of the time, but fails
>>> >> if multiple spark notebooks are being run on the same host -- spark
>>> >> increments the port for each new context, leading to confusion when
>>> >> the notebooks are trying to probe the web interface for information.
>>> >>
>>> >> I'd like to implement an analog to SparkContext.getOrCreate(), perhaps
>>> >> called "getIfExists()" that returns the current singleton if it
>>> >> exists, or None otherwise. The Jupyter code would then be able to use
>>> >> this entrypoint to query Spark for an active Spark context, which it
>>> >> could use to probe the web URL.
>>> >>
>>> >> It's a minor change, but this would be my first contribution to Spark,
>>> >> and I want to make sure my plan was kosher before I implemented it.
>>> >>
>>> >> Thanks!
>>> >> Andrew
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >> [1] https://krishnan-r.github.io/sparkmonitor/
>>> >>
>>> >> [2] https://github.com/mozilla/jupyter-spark
>>> >>
>>> >> -
>>> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>> >>
>>> >

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: no logging in pyspark code?

2018-08-27 Thread Imran Rashid
ah, great, thanks!  sorry I missed that, I'll watch that jira.

On Mon, Aug 27, 2018 at 12:41 PM Ilan Filonenko  wrote:

> A JIRA has been opened up on this exact topic: SPARK-25236
> , a few days ago,
> after seeing another case of print(_, file=sys.stderr) in a most recent
> review. I agree that we should include logging for PySpark workers.
>
> On Mon, Aug 27, 2018 at 1:29 PM, Imran Rashid <
> iras...@cloudera.com.invalid> wrote:
>
>> Another question on pyspark code -- how come there is no logging at all?
>> does python logging have an unreasonable overhead, or its impossible to
>> configure or something?
>>
>> I'm really surprised nobody has ever wanted to me able to turn on some
>> debug or trace logging in pyspark by just configuring a logging level.
>>
>> For me, I wanted this during debugging while developing -- I'd work on
>> some part of the code and drop in a bunch of print statements.  Then I'd
>> rip those out when I think I'm ready to submit a patch.  But then I realize
>> I forgot some case, then more debugging -- oh gotta add those print
>> statements in again ...
>>
>> does somebody jsut need to setup the configuration properly, or is there
>> a bigger reason to avoid logging in python?
>>
>> thanks,
>> Imran
>>
>
>


Re: no logging in pyspark code?

2018-08-27 Thread Ilan Filonenko
A JIRA has been opened up on this exact topic: SPARK-25236
, a few days ago, after
seeing another case of print(_, file=sys.stderr) in a most recent review. I
agree that we should include logging for PySpark workers.

On Mon, Aug 27, 2018 at 1:29 PM, Imran Rashid 
wrote:

> Another question on pyspark code -- how come there is no logging at all?
> does python logging have an unreasonable overhead, or its impossible to
> configure or something?
>
> I'm really surprised nobody has ever wanted to me able to turn on some
> debug or trace logging in pyspark by just configuring a logging level.
>
> For me, I wanted this during debugging while developing -- I'd work on
> some part of the code and drop in a bunch of print statements.  Then I'd
> rip those out when I think I'm ready to submit a patch.  But then I realize
> I forgot some case, then more debugging -- oh gotta add those print
> statements in again ...
>
> does somebody jsut need to setup the configuration properly, or is there a
> bigger reason to avoid logging in python?
>
> thanks,
> Imran
>


no logging in pyspark code?

2018-08-27 Thread Imran Rashid
Another question on pyspark code -- how come there is no logging at all?
does python logging have an unreasonable overhead, or its impossible to
configure or something?

I'm really surprised nobody has ever wanted to me able to turn on some
debug or trace logging in pyspark by just configuring a logging level.

For me, I wanted this during debugging while developing -- I'd work on some
part of the code and drop in a bunch of print statements.  Then I'd rip
those out when I think I'm ready to submit a patch.  But then I realize I
forgot some case, then more debugging -- oh gotta add those print
statements in again ...

does somebody jsut need to setup the configuration properly, or is there a
bigger reason to avoid logging in python?

thanks,
Imran


Why is View logical operator not a UnaryNode explicitly?

2018-08-27 Thread Jacek Laskowski
Hi,

I've just come across View logical operator which is not a UnaryNode
explicitly, i.e. "extends UnaryNode". Why?

https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala?utf8=%E2%9C%93#L460-L463

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
Mastering Spark SQL https://bit.ly/mastering-spark-sql
Spark Structured Streaming https://bit.ly/spark-structured-streaming
Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
Follow me at https://twitter.com/jaceklaskowski