Re: [DISCUSS] SPIP: Python Data Source API

Reynold Xin Sun, 25 Jun 2023 10:06:56 -0700

Personally I'd love this, but I agree with some of the earlier comments that 
this should not be Python specific (meaning I should be able to implement a 
data source in Python and then make it usable across all languages Spark 
supports). I think we should find a way to make this reusable beyond Python 
(especially for SQL).


Python is the most popular programming language by a large margin, in general 
and among Spark users. Many of the organizations that use Spark often don't 
even have a single person that knows Scala. What if they want to implement a 
custom data source to fetch some data? Today we'd have to tell them to learn 
Scala/Java and the fairly complex data source API (v1 or v2).

Maciej - I understand your concern about endpoint throttling etc. And it goes 
much more beyond querying REST endpoints. I personally had that concern too 
when we were adding the JDBC data source (what if somebody launches a 512 node 
Spark cluster to query my single node MySQL cluster?!). But the built-in JDBC 
data source is one of the most popular data sources (I just looked up its usage 
on Databricks and it's by far the #1 data source outside of files, used by > 
10000 organizations everyday).

On Sun, Jun 25, 2023 at 1:38 AM, Maciej < mszymkiew...@gmail.com > wrote:

> 
> 
> 
> Thanks for your feedback Martin.
> 
> However, if the primary intended purpose of this API is to provide an
> interface for endpoint querying, then I find this proposal even less
> convincing.
> 
> 
> 
> Neither the Spark execution model nor the data source API (full or
> restricted as proposed here) are a good fit for handling problems arising
> from massive endpoint requests, including, but not limited to, handling
> quotas and rate limiting.
> 
> 
> 
> Consistency and streamlined development are, of course, valuable.
> Nonetheless, they are not sufficient, especially if they cannot deliver
> the expected user experience in terms of reliability and execution cost.
> 
> 
> 
> 
> 
> 
> Best regards,
> Maciej Szymkiewicz
> 
> Web: https:/ / zero323. net ( https://zero323.net )
> PGP: A30CEF0C31A501EC
> On 6/24/23 23:42, Martin Grund wrote:
> 
> 
>> Hey,
>> 
>> 
>> I would like to express my strong support for Python Data Sources even
>> though they might not be immediately as powerful as Scala-based data
>> sources. One element that is easily lost in this discussion is how much
>> faster the iteration speed is with Python compared to Scala. Due to the
>> dynamic nature of Python, you can design and build a data source while
>> running in a notebook and continuously change the code until it works as
>> you want. This behavior is unparalleled!
>> 
>> 
>> There exists a litany of Python libraries connecting to all kinds of
>> different endpoints that could provide data that is usable with Spark. I
>> personally can imagine implementing a data source on top of the AWS SDK to
>> extract EC2 instance information. Now I don't have to switch tools and can
>> keep my pipeline consistent.
>> 
>> 
>> Let's say you want to query an API in parallel from Spark using Python, today
>> 's way would be to create a Python RDD and then implement the planning and
>> execution process manually. Finally calling `toDF` in the end. While the
>> actual code of the DS and the RDD-based implementation would be very
>> similar, the abstraction that is provided by the DS is much more powerful
>> and future-proof. Performing dynamic partition elimination, and filter
>> push-down can all be implemented at a later point in time.
>> 
>> 
>> Comparing a DS to using batch calling from a UDF is not great because, the
>> execution pattern would be very brittle. Imagine something like
>> `spark.range(10).withColumn("data",
>> fetch_api).explode(col("data")).collect()`. Here you're encoding
>> partitioning logic and data transformation in simple ways, but you can't
>> reason about the structural integrity of the query and tiny changes in the
>> UDF interface might already cause a lot of downstream issues.
>> 
>> 
>> 
>> 
>> Martin
>> 
>> 
>> 
>> On Sat, Jun 24 , 2023 at 1:44 AM Maciej < mszymkiewicz@ gmail. com (
>> mszymkiew...@gmail.com ) > wrote:
>> 
>> 
>>> 
>>> 
>>> With such limited scope (both language availability and features) do we
>>> have any representative examples of sources that could significantly
>>> benefit from providing this API,  compared other available options, such
>>> as batch imports, direct queries from vectorized  UDFs or even interfacing
>>> sources through 3rd party FDWs?
>>> 
>>> 
>>> Best regards,
>>> Maciej Szymkiewicz
>>> 
>>> Web: https:/ / zero323. net ( https://zero323.net )
>>> PGP: A30CEF0C31A501EC
>>> On 6/20/23 16:23, Wenchen Fan wrote:
>>> 
>>> 
>>>> In an ideal world, every data source you want to connect to already has a
>>>> Spark data source implementation (either v1 or v2), then this Python API
>>>> is useless. But I feel it's common that people want to do quick data
>>>> exploration, and the target data system is not popular enough to have an
>>>> existing Spark data source implementation. It will be useful if people can
>>>> quickly implement a Spark data source using their favorite Python
>>>> language.
>>>> 
>>>> 
>>>> I'm +1 to this proposal, assuming that we will keep it simple and won't
>>>> copy all the complicated features we built in DS v2 to this new Python
>>>> API.
>>>> 
>>>> On Tue, Jun 20 , 2023 at 2:11 PM Maciej < mszymkiewicz@ gmail. com (
>>>> mszymkiew...@gmail.com ) > wrote:
>>>> 
>>>> 
>>>>> Similarly to Jacek, I feel it fails to document an actual community need
>>>>> for such a feature.
>>>>> 
>>>>> 
>>>>> Currently, any data source implementation has the potential to benefit
>>>>> Spark users across all supported and third-party clients.  For generally
>>>>> available sources, this is advantageous for the whole Spark community and
>>>>> avoids creating 1st and 2nd-tier citizens. This is even more important
>>>>> with new officially supported languages being added through connect.
>>>>> 
>>>>> 
>>>>> Instead, we might rather document in detail the process of implementing a
>>>>> new source using current APIs and work towards easily extensible or
>>>>> customizable sources, in case there is such a need.
>>>>> 
>>>>> -- 
>>>>> Best regards,
>>>>> Maciej Szymkiewicz
>>>>> 
>>>>> Web: https:/ / zero323. net ( https://zero323.net )
>>>>> PGP: A30CEF0C31A501EC
>>>>> 
>>>>> 
>>>>> On 6/20/23 05:19, Hyukjin Kwon wrote:
>>>>> 
>>>>> 
>>>>>> Actually I support this idea in a way that Python developers don't have 
>>>>>> to
>>>>>> learn Scala to write their own source (and separate packaging).
>>>>>> This is more crucial especially when you want to write a simple data
>>>>>> source that interacts with the Python ecosystem.
>>>>>> 
>>>>>> On Tue , 20 Jun 2023 at 03:08, Denny Lee < denny. g. lee@ gmail. com (
>>>>>> denny.g....@gmail.com ) > wrote:
>>>>>> 
>>>>>> 
>>>>>>> Slightly biased, but per my conversations - this would be awesome to 
>>>>>>> have!
>>>>>>> 
>>>>>>> 
>>>>>>> On Mon, Jun 19 , 2023 at 09:43 Abdeali Kothari < abdealikothari@ gmail. 
>>>>>>> com
>>>>>>> ( abdealikoth...@gmail.com ) > wrote:
>>>>>>> 
>>>>>>> 
>>>>>>>> I would definitely use it - is it's available :)
>>>>>>>> 
>>>>>>>> On Mon , 19 Jun 2023 , 21:56 Jacek Laskowski, < jacek@ japila. pl (
>>>>>>>> ja...@japila.pl ) > wrote:
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> Hi Allison and devs,
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Although I was against this idea at first sight (probably because I'm 
>>>>>>>>> a
>>>>>>>>> Scala dev), I think it could work as long as there are people who'd be
>>>>>>>>> interested in such an API. Were there any? I'm just curious. I've 
>>>>>>>>> seen no
>>>>>>>>> emails requesting it.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> I also doubt that Python devs would like to work on new data sources 
>>>>>>>>> but
>>>>>>>>> support their wishes wholeheartedly :)
>>>>>>>>> 
>>>>>>>>> Pozdrawiam,
>>>>>>>>> Jacek Laskowski
>>>>>>>>> ----
>>>>>>>>> "The Internals Of" Online Books ( https://books.japila.pl/ )
>>>>>>>>> 
>>>>>>>>> Follow me on https:/ / twitter. com/ jaceklaskowski (
>>>>>>>>> https://twitter.com/jaceklaskowski )
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> ( https://twitter.com/jaceklaskowski )
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Fri, Jun 16 , 2023 at 6:14 AM Allison Wang <allison. wang@ 
>>>>>>>>> databricks. com.
>>>>>>>>> invalid> ( allison.w...@databricks.com.invalid ) wrote:
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> Hi everyone,
>>>>>>>>>> 
>>>>>>>>>> I would like to start a discussion on “Python Data Source API”.
>>>>>>>>>> 
>>>>>>>>>> This proposal aims to introduce a simple API in Python for Data 
>>>>>>>>>> Sources.
>>>>>>>>>> The idea is to enable Python developers to create data sources 
>>>>>>>>>> without
>>>>>>>>>> having to learn Scala or deal with the complexities of the current 
>>>>>>>>>> data
>>>>>>>>>> source APIs. The goal is to make a Python-based API that is simple 
>>>>>>>>>> and
>>>>>>>>>> easy to use, thus making Spark more accessible to the wider Python
>>>>>>>>>> developer community. This proposed approach is based on the recently
>>>>>>>>>> introduced Python user-defined table functions with extensions to 
>>>>>>>>>> support
>>>>>>>>>> data sources.
>>>>>>>>>> 
>>>>>>>>>> *SPIP Doc* : https:/ / docs. google. com/ document/ d/ 
>>>>>>>>>> 1oYrCKEKHzznljYfJO4kx5K_Npcgt1Slyfph3NEk7JRU/
>>>>>>>>>> edit?usp=sharing (
>>>>>>>>>> https://docs.google.com/document/d/1oYrCKEKHzznljYfJO4kx5K_Npcgt1Slyfph3NEk7JRU/edit?usp=sharing
>>>>>>>>>> )
>>>>>>>>>> 
>>>>>>>>>> *SPIP JIRA* : https:/ / issues. apache. org/ jira/ browse/ 
>>>>>>>>>> SPARK-44076 (
>>>>>>>>>> https://issues.apache.org/jira/browse/SPARK-44076 )
>>>>>>>>>> 
>>>>>>>>>> Looking forward to your feedback.
>>>>>>>>>> 
>>>>>>>>>> Thanks,
>>>>>>>>>> Allison
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
> 
>

smime.p7s
Description: S/MIME Cryptographic Signature

Re: [DISCUSS] SPIP: Python Data Source API

Reply via email to