Re: Python friendly API for Spark 3.0

2018-09-14 Thread Nicholas Chammas
Do we need to ditch Python 2 support to provide type hints? I don’t think
so.

Python lets you specify typing stubs that provide the same benefit without
forcing Python 3.

2018년 9월 14일 (금) 오후 8:01, Holden Karau 님이 작성:

>
>
> On Fri, Sep 14, 2018, 3:26 PM Erik Erlandson  wrote:
>
>> To be clear, is this about "python-friendly API" or "friendly python API"
>> ?
>>
> Well what would you consider to be different between those two statements?
> I think it would be good to be a bit more explicit, but I don't think we
> should necessarily limit ourselves.
>
>>
>> On the python side, it might be nice to take advantage of static typing.
>> Requires python 3.6 but with python 2 going EOL, a spark-3.0 might be a
>> good opportunity to jump the python-3-only train.
>>
> I think we can make types sort of work without ditching 2 (the types only
> would work in 3 but it would still function in 2). Ditching 2 entirely
> would be a big thing to consider, I honestly hadn't been considering that
> but it could be from just spending so much time maintaining a 2/3 code
> base. I'd suggest reaching out to to user@ before making that kind of
> change.
>
>>
>> On Fri, Sep 14, 2018 at 12:15 PM, Holden Karau 
>> wrote:
>>
>>> Since we're talking about Spark 3.0 in the near future (and since some
>>> recent conversation on a proposed change reminded me) I wanted to open up
>>> the floor and see if folks have any ideas on how we could make a more
>>> Python friendly API for 3.0? I'm planning on taking some time to look at
>>> other systems in the solution space and see what we might want to learn
>>> from them but I'd love to hear what other folks are thinking too.
>>>
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>>
>>


Re: Python friendly API for Spark 3.0

2018-09-14 Thread Holden Karau
On Fri, Sep 14, 2018, 3:26 PM Erik Erlandson  wrote:

> To be clear, is this about "python-friendly API" or "friendly python API" ?
>
Well what would you consider to be different between those two statements?
I think it would be good to be a bit more explicit, but I don't think we
should necessarily limit ourselves.

>
> On the python side, it might be nice to take advantage of static typing.
> Requires python 3.6 but with python 2 going EOL, a spark-3.0 might be a
> good opportunity to jump the python-3-only train.
>
I think we can make types sort of work without ditching 2 (the types only
would work in 3 but it would still function in 2). Ditching 2 entirely
would be a big thing to consider, I honestly hadn't been considering that
but it could be from just spending so much time maintaining a 2/3 code
base. I'd suggest reaching out to to user@ before making that kind of
change.

>
> On Fri, Sep 14, 2018 at 12:15 PM, Holden Karau 
> wrote:
>
>> Since we're talking about Spark 3.0 in the near future (and since some
>> recent conversation on a proposed change reminded me) I wanted to open up
>> the floor and see if folks have any ideas on how we could make a more
>> Python friendly API for 3.0? I'm planning on taking some time to look at
>> other systems in the solution space and see what we might want to learn
>> from them but I'd love to hear what other folks are thinking too.
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>
>


Re: Python friendly API for Spark 3.0

2018-09-14 Thread Erik Erlandson
To be clear, is this about "python-friendly API" or "friendly python API" ?

On the python side, it might be nice to take advantage of static typing.
Requires python 3.6 but with python 2 going EOL, a spark-3.0 might be a
good opportunity to jump the python-3-only train.

On Fri, Sep 14, 2018 at 12:15 PM, Holden Karau  wrote:

> Since we're talking about Spark 3.0 in the near future (and since some
> recent conversation on a proposed change reminded me) I wanted to open up
> the floor and see if folks have any ideas on how we could make a more
> Python friendly API for 3.0? I'm planning on taking some time to look at
> other systems in the solution space and see what we might want to learn
> from them but I'd love to hear what other folks are thinking too.
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/
> 2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


Python friendly API for Spark 3.0

2018-09-14 Thread Holden Karau
Since we're talking about Spark 3.0 in the near future (and since some
recent conversation on a proposed change reminded me) I wanted to open up
the floor and see if folks have any ideas on how we could make a more
Python friendly API for 3.0? I'm planning on taking some time to look at
other systems in the solution space and see what we might want to learn
from them but I'd love to hear what other folks are thinking too.

-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


RE: [DISCUSS][CORE] Exposing application status metrics via a source

2018-09-14 Thread Luca Canali
Hi Stavros, All,

Interesting topic, I add here some thoughts and personal opinions on it: I find 
too the metrics system quite useful for the use case of building Grafana 
dashboards as opposed to scraping logs and/or using the Event Listener 
infrastructure, as you mentioned in your mail.
A few additional points in favour of Dropwizard metrics for me are:

-  Regarding the metrics defined on the ExecutorSource, I believe they 
have better scalability compare to standard Task Metrics, as the Dropwizard 
metrics go directly from executors to sink(s) rather than passing via the 
driver through the ListenerBus.

-  Another advantage that I see is that Dropwizard metrics make it easy 
to expose information not available otherwise from the EveloLog/Listener 
events, such as executor.jvmCpuTime (SPARK-25228).

I ’d like to add some feedback and random thoughts based on recent work on 
SPARK-25228 and SPARK-22190, SPARK-25277, SPARK-25285:

-  the “Dropwizard metrics” space currently appears a bit “crowded”,  
we could probably profit from adding a few configuration parameters to turn 
some of the metrics on/off as needed (I see that this point is also raised in 
the discussion in your PR 22381).

-  Another point is that the metrics instrumentation is a bit scattered 
around the code, it would be nice to have a central point where the available 
metrics are exposed (maybe just in the documentation).

-  Testing of new metrics seems to be a bit of a manual process at the 
moment (at least it was for me) which could be improved. Related to that I 
believe that some recent work on adding new metrics has ended up with a minor 
side effect/issue, details in SPARK-25277.

Best regards,
Luca

From: Stavros Kontopoulos 
Sent: Wednesday, September 12, 2018 22:35
To: Dev 
Subject: [DISCUSS][CORE] Exposing application status metrics via a source

Hi all,

I have a PR https://github.com/apache/spark/pull/22381 that exposes application 
status
metrics (related jira: SPARK-25394).

So far metrics tooling needs to scrape the metrics rest api to get metrics like 
job delay, stages failed, stages completed etc.
From devops perspective it is good to standardize on a unified way of gathering 
metrics.
The need came up on the K8s side where jmx prometheus exporter is commonly used 
to scrape metrics for several components such as kafka, cassandra, but the need 
is not limited there.

Check comment 
here:
"The rest api is great for UI and consolidated analytics, but monitoring 
through it is not as straightforward as when the data emits directly from the 
source like this. There is all kinds of nice context that we get when the data 
from this spark node is collected directly from the node itself, and not 
proxied through another collector / reporter. It is easier to build a 
monitoring data model across the cluster when node, jmx, pod, resource 
manifests, and spark data all align by virtue of coming from the same 
collector. Building a similar view of the cluster just from the rest api, as a 
comparison, is simply harder and quite challenging to do in general purpose 
terms."

The PR is ok to be merged but the major concern here is the mirroring of the 
metrics. I think that mirroring is ok since people may dont want to check the 
ui and they just want to integrate with jmx only (my use case) and gather 
metrics in grafana (common case out there).

Does any of the committers or the community have an opinion on this?
Is there an agreement about moving on with this? Note that the addition does 
not change much and can always be refactored if we come up with a new plan for 
the metrics story in the future.

Thanks,
Stavros