[Discuss] Datasource v2 support for Kerberos

2018-09-15 Thread tigerquoll
The current V2 Datasource API provides support for querying a portion of the
SparkConfig namespace (spark.datasource.*) via the SessionConfigSupport API. 
This was designed with the assumption that all configuration information for
v2 data sources should be separate from each other.

Unfortunately, there are some cross-cutting concerns such as authentication
that touch multiple data sources - this means that common configuration
items need to be shared amongst multiple data sources.
In particular, Kerberos setup can use the following configuration items:

* userPrincipal, 
* userKeytabPath
* krb5ConfPath
* kerberos debugging flags
* spark.security.credentials.${service}.enabled
* JAAS config
* ZKServerPrincipal ??

So potential solutions I can think of to pass this information to various
data sources are:

* Pass the entire SparkContext object to data sources (not likely)
* Pass the entire SparkConfig Map object to data sources
* Pass all required configuration via environment variables
* Extend SessionConfigSupport to support passing specific white-listed
configuration values
* Add a specific data source v2 API "SupportsKerberos" so that a data source
can indicate that it supports Kerberos and also provide the means to pass
needed configuration info.
* Expand out all Kerberos configuration items to be in each data source
config namespace that needs it.

If the data source requires TLS support then we also need to support passing
all the  configuration values under  "spark.ssl.*"

What do people think?  Placeholder Issue has been added at SPARK-25329.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



[Discuss] Datasource v2 support for manipulating partitions

2018-09-15 Thread tigerquoll
I've been following the development of the new data source abstraction with
keen interest.  One of the issues that has occurred to me as I sat down and
planned how I would implement a data source is how I would support
manipulating partitions.

My reading of the current prototype is that Data source v2 APIs expose
enough of a concept of a partition to support communicating record
distribution particulars to catalyst, but does not represent partitions as a
concept that the end user of the data sources can manipulate.

The end users of data sources need to be able to add/drop/modify and list
partitions. For example, many systems require partitions to be created
before records are added to them.  

For batch use-cases, it may be possible for users to manipulate partitions
from within the environment that the data source interfaces to, but for
streaming use-cases, this is not at all practical.

Two ways I can think of doing this are:
1. Allow "pass-through" commands to the underlying data source
2. Have a generic concept of partitions exposed to the end user via the data
source API and Spark SQL DML.

I'm keen for option 2 but recognise that its possible there are better
alternatives out there.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



RE: Support STS to run in k8s deployment with spark deployment mode as cluster

2018-09-15 Thread Garlapati, Suryanarayana (Nokia - IN/Bangalore)
Hi,
Following is the bug to track the same.

https://issues.apache.org/jira/browse/SPARK-25442

Regards
Surya

From: Garlapati, Suryanarayana (Nokia - IN/Bangalore)
Sent: Sunday, September 16, 2018 10:15 AM
To: dev@spark.apache.org; Ilan Filonenko 
Cc: u...@spark.apache.org; Imandi, Srinivas (Nokia - IN/Bangalore) 
; Chakradhar, N R (Nokia - IN/Bangalore) 
; Rao, Abhishek (Nokia - IN/Bangalore) 

Subject: Support STS to run in k8s deployment with spark deployment mode as 
cluster

Hi All,
I would like to propose the following changes for supporting the STS to run in 
k8s deployments with spark deployment mode as cluster.

PR: https://github.com/apache/spark/pull/22433

Can you please review and provide the comments?


Regards
Surya



Re: from_csv

2018-09-15 Thread Reynold Xin
makes sense - i'd make this as consistent as to_json / from_json as
possible.

how would this work in sql? i.e. how would passing options in work?

--
excuse the brevity and lower case due to wrist injury


On Sat, Sep 15, 2018 at 2:58 AM Maxim Gekk 
wrote:

> Hi All,
>
> I would like to propose new function from_csv() for parsing columns
> containing strings in CSV format. Here is my PR:
> https://github.com/apache/spark/pull/22379
>
> An use case is loading a dataset from an external storage, dbms or systems
> like Kafka to where CSV content was dumped as one of columns/fields. Other
> columns could contain related information like timestamps, ids, sources of
> data and etc. The column with CSV strings can be parsed by existing method
> csv() of DataFrameReader but in that case we have to "clean up" dataset
> and remove other columns since the csv() method requires Dataset[String].
> Joining back result of parsing and original dataset by positions is
> expensive and not convenient. Instead users parse CSV columns by string
> functions. The approach is usually error prone especially for quoted values
> and other special cases.
>
> The proposed in the PR methods should make a better user experience in
> parsing CSV-like columns. Please, share your thoughts.
>
> --
>
> Maxim Gekk
>
> Technical Solutions Lead
>
> Databricks Inc.
>
> maxim.g...@databricks.com
>
> databricks.com
>
>   
>


Support STS to run in k8s deployment with spark deployment mode as cluster

2018-09-15 Thread Garlapati, Suryanarayana (Nokia - IN/Bangalore)
Hi All,
I would like to propose the following changes for supporting the STS to run in 
k8s deployments with spark deployment mode as cluster.

PR: https://github.com/apache/spark/pull/22433

Can you please review and provide the comments?


Regards
Surya



Re: Python friendly API for Spark 3.0

2018-09-15 Thread Jules Damji
+1 
I think phasing out EOL of any feature or supported language is a better 
strategy if possible than a quick drop. With enough admonition, it can 
gradually be dropped in 3.x— of course, there are exceptions. 

Cheers
Jules 

Sent from my iPhone
Pardon the dumb thumb typos :)

> On Sep 15, 2018, at 10:49 AM, Reynold Xin  wrote:
> 
> we can also declare python 2 as deprecated and drop it in 3.x, not 
> necessarily 3.0.
> 
> --
> excuse the brevity and lower case due to wrist injury
> 
> 
>> On Sat, Sep 15, 2018 at 10:33 AM Erik Erlandson  wrote:
>> I am probably splitting hairs to finely, but I was considering the 
>> difference between improvements to the jvm-side (py4j and the scala/java 
>> code) that would make it easier to write the python layer ("python-friendly 
>> api"), and actual improvements to the python layers ("friendly python api").
>> 
>> They're not mutually exclusive of course, and both worth working on. But 
>> it's *possible* to improve either without the other.
>> 
>> Stub files look like a great solution for type annotations, maybe even if 
>> only python 3 is supported.
>> 
>> I definitely agree that any decision to drop python 2 should not be taken 
>> lightly. Anecdotally, I'm seeing an increase in python developers announcing 
>> that they are dropping support for python 2 (and loving it). As people have 
>> already pointed out, if we don't drop python 2 for spark 3.0, we're stuck 
>> with it until 4.0, which would place spark in a possibly-awkward position of 
>> supporting python 2 for some time after it goes EOL.
>> 
>> Under the current release cadence, spark 3.0 will land some time in early 
>> 2019, which at that point will be mere months until EOL for py2.
>> 
>>> On Fri, Sep 14, 2018 at 5:01 PM, Holden Karau  wrote:
>>> 
>>> 
 On Fri, Sep 14, 2018, 3:26 PM Erik Erlandson  wrote:
 To be clear, is this about "python-friendly API" or "friendly python API" ?
>>> 
>>> Well what would you consider to be different between those two statements? 
>>> I think it would be good to be a bit more explicit, but I don't think we 
>>> should necessarily limit ourselves.
 
 On the python side, it might be nice to take advantage of static typing. 
 Requires python 3.6 but with python 2 going EOL, a spark-3.0 might be a 
 good opportunity to jump the python-3-only train.
>>> 
>>> I think we can make types sort of work without ditching 2 (the types only 
>>> would work in 3 but it would still function in 2). Ditching 2 entirely 
>>> would be a big thing to consider, I honestly hadn't been considering that 
>>> but it could be from just spending so much time maintaining a 2/3 code 
>>> base. I'd suggest reaching out to to user@ before making that kind of 
>>> change.
 
> On Fri, Sep 14, 2018 at 12:15 PM, Holden Karau  
> wrote:
> Since we're talking about Spark 3.0 in the near future (and since some 
> recent conversation on a proposed change reminded me) I wanted to open up 
> the floor and see if folks have any ideas on how we could make a more 
> Python friendly API for 3.0? I'm planning on taking some time to look at 
> other systems in the solution space and see what we might want to learn 
> from them but I'd love to hear what other folks are thinking too.
> 
> -- 
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.): 
> https://amzn.to/2MaRAG9 
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
 
>> 


Re: Should python-2 be supported in Spark 3.0?

2018-09-15 Thread Erik Erlandson
In case this didn't make it onto this thread:

There is a 3rd option, which is to deprecate Py2 for Spark-3.0, and remove
it entirely on a later 3.x release.

On Sat, Sep 15, 2018 at 11:09 AM, Erik Erlandson 
wrote:

> On a separate dev@spark thread, I raised a question of whether or not to
> support python 2 in Apache Spark, going forward into Spark 3.0.
>
> Python-2 is going EOL  at
> the end of 2019. The upcoming release of Spark 3.0 is an opportunity to
> make breaking changes to Spark's APIs, and so it is a good time to consider
> support for Python-2 on PySpark.
>
> Key advantages to dropping Python 2 are:
>
>- Support for PySpark becomes significantly easier.
>- Avoid having to support Python 2 until Spark 4.0, which is likely to
>imply supporting Python 2 for some time after it goes EOL.
>
> (Note that supporting python 2 after EOL means, among other things, that
> PySpark would be supporting a version of python that was no longer
> receiving security patches)
>
> The main disadvantage is that PySpark users who have legacy python-2 code
> would have to migrate their code to python 3 to take advantage of Spark 3.0
>
> This decision obviously has large implications for the Apache Spark
> community and we want to solicit community feedback.
>
>


Re: Should python-2 be supported in Spark 3.0?

2018-09-15 Thread Nicholas Chammas
As Reynold pointed out, we don't have to drop Python 2 support right off
the bat. We can just deprecate it with Spark 3.0, which would allow us to
actually drop it at a later 3.x release.

On Sat, Sep 15, 2018 at 2:09 PM Erik Erlandson  wrote:

> On a separate dev@spark thread, I raised a question of whether or not to
> support python 2 in Apache Spark, going forward into Spark 3.0.
>
> Python-2 is going EOL  at
> the end of 2019. The upcoming release of Spark 3.0 is an opportunity to
> make breaking changes to Spark's APIs, and so it is a good time to consider
> support for Python-2 on PySpark.
>
> Key advantages to dropping Python 2 are:
>
>- Support for PySpark becomes significantly easier.
>- Avoid having to support Python 2 until Spark 4.0, which is likely to
>imply supporting Python 2 for some time after it goes EOL.
>
> (Note that supporting python 2 after EOL means, among other things, that
> PySpark would be supporting a version of python that was no longer
> receiving security patches)
>
> The main disadvantage is that PySpark users who have legacy python-2 code
> would have to migrate their code to python 3 to take advantage of Spark 3.0
>
> This decision obviously has large implications for the Apache Spark
> community and we want to solicit community feedback.
>
>


Should python-2 be supported in Spark 3.0?

2018-09-15 Thread Erik Erlandson
On a separate dev@spark thread, I raised a question of whether or not to
support python 2 in Apache Spark, going forward into Spark 3.0.

Python-2 is going EOL  at the
end of 2019. The upcoming release of Spark 3.0 is an opportunity to make
breaking changes to Spark's APIs, and so it is a good time to consider
support for Python-2 on PySpark.

Key advantages to dropping Python 2 are:

   - Support for PySpark becomes significantly easier.
   - Avoid having to support Python 2 until Spark 4.0, which is likely to
   imply supporting Python 2 for some time after it goes EOL.

(Note that supporting python 2 after EOL means, among other things, that
PySpark would be supporting a version of python that was no longer
receiving security patches)

The main disadvantage is that PySpark users who have legacy python-2 code
would have to migrate their code to python 3 to take advantage of Spark 3.0

This decision obviously has large implications for the Apache Spark
community and we want to solicit community feedback.


Re: Python friendly API for Spark 3.0

2018-09-15 Thread Reynold Xin
we can also declare python 2 as deprecated and drop it in 3.x, not
necessarily 3.0.

--
excuse the brevity and lower case due to wrist injury


On Sat, Sep 15, 2018 at 10:33 AM Erik Erlandson  wrote:

> I am probably splitting hairs to finely, but I was considering the
> difference between improvements to the jvm-side (py4j and the scala/java
> code) that would make it easier to write the python layer ("python-friendly
> api"), and actual improvements to the python layers ("friendly python api").
>
> They're not mutually exclusive of course, and both worth working on. But
> it's *possible* to improve either without the other.
>
> Stub files look like a great solution for type annotations, maybe even if
> only python 3 is supported.
>
> I definitely agree that any decision to drop python 2 should not be taken
> lightly. Anecdotally, I'm seeing an increase in python developers
> announcing that they are dropping support for python 2 (and loving it). As
> people have already pointed out, if we don't drop python 2 for spark 3.0,
> we're stuck with it until 4.0, which would place spark in a
> possibly-awkward position of supporting python 2 for some time after it
> goes EOL.
>
> Under the current release cadence, spark 3.0 will land some time in early
> 2019, which at that point will be mere months until EOL for py2.
>
> On Fri, Sep 14, 2018 at 5:01 PM, Holden Karau 
> wrote:
>
>>
>>
>> On Fri, Sep 14, 2018, 3:26 PM Erik Erlandson  wrote:
>>
>>> To be clear, is this about "python-friendly API" or "friendly python
>>> API" ?
>>>
>> Well what would you consider to be different between those two
>> statements? I think it would be good to be a bit more explicit, but I don't
>> think we should necessarily limit ourselves.
>>
>>>
>>> On the python side, it might be nice to take advantage of static typing.
>>> Requires python 3.6 but with python 2 going EOL, a spark-3.0 might be a
>>> good opportunity to jump the python-3-only train.
>>>
>> I think we can make types sort of work without ditching 2 (the types only
>> would work in 3 but it would still function in 2). Ditching 2 entirely
>> would be a big thing to consider, I honestly hadn't been considering that
>> but it could be from just spending so much time maintaining a 2/3 code
>> base. I'd suggest reaching out to to user@ before making that kind of
>> change.
>>
>>>
>>> On Fri, Sep 14, 2018 at 12:15 PM, Holden Karau 
>>> wrote:
>>>
 Since we're talking about Spark 3.0 in the near future (and since some
 recent conversation on a proposed change reminded me) I wanted to open up
 the floor and see if folks have any ideas on how we could make a more
 Python friendly API for 3.0? I'm planning on taking some time to look at
 other systems in the solution space and see what we might want to learn
 from them but I'd love to hear what other folks are thinking too.

 --
 Twitter: https://twitter.com/holdenkarau
 Books (Learning Spark, High Performance Spark, etc.):
 https://amzn.to/2MaRAG9  
 YouTube Live Streams: https://www.youtube.com/user/holdenkarau

>>>
>>>
>


Re: Python friendly API for Spark 3.0

2018-09-15 Thread Erik Erlandson
I am probably splitting hairs to finely, but I was considering the
difference between improvements to the jvm-side (py4j and the scala/java
code) that would make it easier to write the python layer ("python-friendly
api"), and actual improvements to the python layers ("friendly python api").

They're not mutually exclusive of course, and both worth working on. But
it's *possible* to improve either without the other.

Stub files look like a great solution for type annotations, maybe even if
only python 3 is supported.

I definitely agree that any decision to drop python 2 should not be taken
lightly. Anecdotally, I'm seeing an increase in python developers
announcing that they are dropping support for python 2 (and loving it). As
people have already pointed out, if we don't drop python 2 for spark 3.0,
we're stuck with it until 4.0, which would place spark in a
possibly-awkward position of supporting python 2 for some time after it
goes EOL.

Under the current release cadence, spark 3.0 will land some time in early
2019, which at that point will be mere months until EOL for py2.

On Fri, Sep 14, 2018 at 5:01 PM, Holden Karau  wrote:

>
>
> On Fri, Sep 14, 2018, 3:26 PM Erik Erlandson  wrote:
>
>> To be clear, is this about "python-friendly API" or "friendly python API"
>> ?
>>
> Well what would you consider to be different between those two statements?
> I think it would be good to be a bit more explicit, but I don't think we
> should necessarily limit ourselves.
>
>>
>> On the python side, it might be nice to take advantage of static typing.
>> Requires python 3.6 but with python 2 going EOL, a spark-3.0 might be a
>> good opportunity to jump the python-3-only train.
>>
> I think we can make types sort of work without ditching 2 (the types only
> would work in 3 but it would still function in 2). Ditching 2 entirely
> would be a big thing to consider, I honestly hadn't been considering that
> but it could be from just spending so much time maintaining a 2/3 code
> base. I'd suggest reaching out to to user@ before making that kind of
> change.
>
>>
>> On Fri, Sep 14, 2018 at 12:15 PM, Holden Karau 
>> wrote:
>>
>>> Since we're talking about Spark 3.0 in the near future (and since some
>>> recent conversation on a proposed change reminded me) I wanted to open up
>>> the floor and see if folks have any ideas on how we could make a more
>>> Python friendly API for 3.0? I'm planning on taking some time to look at
>>> other systems in the solution space and see what we might want to learn
>>> from them but I'd love to hear what other folks are thinking too.
>>>
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/
>>> 2MaRAG9  
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>>
>>


Re: Python friendly API for Spark 3.0

2018-09-15 Thread Leif Walsh
Hey there,

Here’s something I proposed recently that’s in this space.
https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-24258

It’s motivated by working with a user who wanted to do some custom
statistics for which they could write the numpy code, and knew in what
dimensions they could parallelize it, but in actually getting it running,
the type system really got in the way.


On Fri, Sep 14, 2018 at 15:15 Holden Karau  wrote:

> Since we're talking about Spark 3.0 in the near future (and since some
> recent conversation on a proposed change reminded me) I wanted to open up
> the floor and see if folks have any ideas on how we could make a more
> Python friendly API for 3.0? I'm planning on taking some time to look at
> other systems in the solution space and see what we might want to learn
> from them but I'd love to hear what other folks are thinking too.
>
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>
-- 
-- 
Cheers,
Leif


Re: Python friendly API for Spark 3.0

2018-09-15 Thread Maciej Szymkiewicz
For the reference I raised question of Python 2 support before -
http://apache-spark-developers-list.1001551.n3.nabble.com/Future-of-the-Python-2-support-td20094.html



On Sat, 15 Sep 2018 at 15:14, Alexander Shorin  wrote:

> What's the release due for Apache Spark 3.0? Will it be tomorrow or
> somewhere at the middle of 2019 year?
>
> I think we shouldn't care much about Python 2.x today, since quite
> soon it support turns into pumpkin. For today's projects I hope nobody
> takes into account support of 2.7 unless there is some legacy still to
> carry on, but do we want to take that baggage into Apache Spark 3.x
> era? The next time you may drop it would be only 4.0 release because
> of breaking change.
>
> --
> ,,,^..^,,,
> On Sat, Sep 15, 2018 at 2:21 PM Maciej Szymkiewicz
>  wrote:
> >
> > There is no need to ditch Python 2. There are basically two options
> >
> > Use stub files and limit yourself to support only Python 3 support.
> Python 3 users benefit from type hints, Python 2 users don't, but no core
> functionality is affected. This is the approach I've used with
> https://github.com/zero323/pyspark-stubs/.
> > Use comment based inline syntax or stub files and don't use backward
> incompatible features (primarily typing module -
> https://docs.python.org/3/library/typing.html). Both Python 2 and 3 is
> supported, but more advanced components are not. Small win for Python 2
> users, moderate loss for Python 3 users.
> >
> >
> >
> > On Sat, 15 Sep 2018 at 02:38, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
> >>
> >> Do we need to ditch Python 2 support to provide type hints? I don’t
> think so.
> >>
> >> Python lets you specify typing stubs that provide the same benefit
> without forcing Python 3.
> >>
> >> 2018년 9월 14일 (금) 오후 8:01, Holden Karau 님이 작성:
> >>>
> >>>
> >>>
> >>> On Fri, Sep 14, 2018, 3:26 PM Erik Erlandson 
> wrote:
> 
>  To be clear, is this about "python-friendly API" or "friendly python
> API" ?
> >>>
> >>> Well what would you consider to be different between those two
> statements? I think it would be good to be a bit more explicit, but I don't
> think we should necessarily limit ourselves.
> 
> 
>  On the python side, it might be nice to take advantage of static
> typing. Requires python 3.6 but with python 2 going EOL, a spark-3.0 might
> be a good opportunity to jump the python-3-only train.
> >>>
> >>> I think we can make types sort of work without ditching 2 (the types
> only would work in 3 but it would still function in 2). Ditching 2 entirely
> would be a big thing to consider, I honestly hadn't been considering that
> but it could be from just spending so much time maintaining a 2/3 code
> base. I'd suggest reaching out to to user@ before making that kind of
> change.
> 
> 
>  On Fri, Sep 14, 2018 at 12:15 PM, Holden Karau 
> wrote:
> >
> > Since we're talking about Spark 3.0 in the near future (and since
> some recent conversation on a proposed change reminded me) I wanted to open
> up the floor and see if folks have any ideas on how we could make a more
> Python friendly API for 3.0? I'm planning on taking some time to look at
> other systems in the solution space and see what we might want to learn
> from them but I'd love to hear what other folks are thinking too.
> >
> > --
> > Twitter: https://twitter.com/holdenkarau
> > Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9
> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau
> 
> 
> >
> >
>


Re: Python friendly API for Spark 3.0

2018-09-15 Thread Alexander Shorin
What's the release due for Apache Spark 3.0? Will it be tomorrow or
somewhere at the middle of 2019 year?

I think we shouldn't care much about Python 2.x today, since quite
soon it support turns into pumpkin. For today's projects I hope nobody
takes into account support of 2.7 unless there is some legacy still to
carry on, but do we want to take that baggage into Apache Spark 3.x
era? The next time you may drop it would be only 4.0 release because
of breaking change.

--
,,,^..^,,,
On Sat, Sep 15, 2018 at 2:21 PM Maciej Szymkiewicz
 wrote:
>
> There is no need to ditch Python 2. There are basically two options
>
> Use stub files and limit yourself to support only Python 3 support. Python 3 
> users benefit from type hints, Python 2 users don't, but no core 
> functionality is affected. This is the approach I've used with 
> https://github.com/zero323/pyspark-stubs/.
> Use comment based inline syntax or stub files and don't use backward 
> incompatible features (primarily typing module - 
> https://docs.python.org/3/library/typing.html). Both Python 2 and 3 is 
> supported, but more advanced components are not. Small win for Python 2 
> users, moderate loss for Python 3 users.
>
>
>
> On Sat, 15 Sep 2018 at 02:38, Nicholas Chammas  
> wrote:
>>
>> Do we need to ditch Python 2 support to provide type hints? I don’t think so.
>>
>> Python lets you specify typing stubs that provide the same benefit without 
>> forcing Python 3.
>>
>> 2018년 9월 14일 (금) 오후 8:01, Holden Karau 님이 작성:
>>>
>>>
>>>
>>> On Fri, Sep 14, 2018, 3:26 PM Erik Erlandson  wrote:

 To be clear, is this about "python-friendly API" or "friendly python API" ?
>>>
>>> Well what would you consider to be different between those two statements? 
>>> I think it would be good to be a bit more explicit, but I don't think we 
>>> should necessarily limit ourselves.


 On the python side, it might be nice to take advantage of static typing. 
 Requires python 3.6 but with python 2 going EOL, a spark-3.0 might be a 
 good opportunity to jump the python-3-only train.
>>>
>>> I think we can make types sort of work without ditching 2 (the types only 
>>> would work in 3 but it would still function in 2). Ditching 2 entirely 
>>> would be a big thing to consider, I honestly hadn't been considering that 
>>> but it could be from just spending so much time maintaining a 2/3 code 
>>> base. I'd suggest reaching out to to user@ before making that kind of 
>>> change.


 On Fri, Sep 14, 2018 at 12:15 PM, Holden Karau  
 wrote:
>
> Since we're talking about Spark 3.0 in the near future (and since some 
> recent conversation on a proposed change reminded me) I wanted to open up 
> the floor and see if folks have any ideas on how we could make a more 
> Python friendly API for 3.0? I'm planning on taking some time to look at 
> other systems in the solution space and see what we might want to learn 
> from them but I'd love to hear what other folks are thinking too.
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.): 
> https://amzn.to/2MaRAG9
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau


>
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Python friendly API for Spark 3.0

2018-09-15 Thread Maciej Szymkiewicz
There is no need to ditch Python 2. There are basically two options

   - Use stub files and limit yourself to support only Python 3 support.
   Python 3 users benefit from type hints, Python 2 users don't, but no core
   functionality is affected. This is the approach I've used with
   https://github.com/zero323/pyspark-stubs/.
   - Use comment based inline syntax or stub files and don't use backward
   incompatible features (primarily typing module -
   https://docs.python.org/3/library/typing.html). Both Python 2 and 3 is
   supported, but more advanced components are not. Small win for Python 2
   users, moderate loss for Python 3 users.



On Sat, 15 Sep 2018 at 02:38, Nicholas Chammas 
wrote:

> Do we need to ditch Python 2 support to provide type hints? I don’t think
> so.
>
> Python lets you specify typing stubs that provide the same benefit without
> forcing Python 3.
>
> 2018년 9월 14일 (금) 오후 8:01, Holden Karau 님이 작성:
>
>>
>>
>> On Fri, Sep 14, 2018, 3:26 PM Erik Erlandson  wrote:
>>
>>> To be clear, is this about "python-friendly API" or "friendly python
>>> API" ?
>>>
>> Well what would you consider to be different between those two
>> statements? I think it would be good to be a bit more explicit, but I don't
>> think we should necessarily limit ourselves.
>>
>>>
>>> On the python side, it might be nice to take advantage of static typing.
>>> Requires python 3.6 but with python 2 going EOL, a spark-3.0 might be a
>>> good opportunity to jump the python-3-only train.
>>>
>> I think we can make types sort of work without ditching 2 (the types only
>> would work in 3 but it would still function in 2). Ditching 2 entirely
>> would be a big thing to consider, I honestly hadn't been considering that
>> but it could be from just spending so much time maintaining a 2/3 code
>> base. I'd suggest reaching out to to user@ before making that kind of
>> change.
>>
>>>
>>> On Fri, Sep 14, 2018 at 12:15 PM, Holden Karau 
>>> wrote:
>>>
 Since we're talking about Spark 3.0 in the near future (and since some
 recent conversation on a proposed change reminded me) I wanted to open up
 the floor and see if folks have any ideas on how we could make a more
 Python friendly API for 3.0? I'm planning on taking some time to look at
 other systems in the solution space and see what we might want to learn
 from them but I'd love to hear what other folks are thinking too.

 --
 Twitter: https://twitter.com/holdenkarau
 Books (Learning Spark, High Performance Spark, etc.):
 https://amzn.to/2MaRAG9  
 YouTube Live Streams: https://www.youtube.com/user/holdenkarau

>>>
>>>


from_csv

2018-09-15 Thread Maxim Gekk
Hi All,

I would like to propose new function from_csv() for parsing columns
containing strings in CSV format. Here is my PR:
https://github.com/apache/spark/pull/22379

An use case is loading a dataset from an external storage, dbms or systems
like Kafka to where CSV content was dumped as one of columns/fields. Other
columns could contain related information like timestamps, ids, sources of
data and etc. The column with CSV strings can be parsed by existing method
csv() of DataFrameReader but in that case we have to "clean up" dataset and
remove other columns since the csv() method requires Dataset[String].
Joining back result of parsing and original dataset by positions is
expensive and not convenient. Instead users parse CSV columns by string
functions. The approach is usually error prone especially for quoted values
and other special cases.

The proposed in the PR methods should make a better user experience in
parsing CSV-like columns. Please, share your thoughts.

-- 

Maxim Gekk

Technical Solutions Lead

Databricks Inc.

maxim.g...@databricks.com

databricks.com