Personally I'd love this, but I agree with some of the earlier comments that this should not be Python specific (meaning I should be able to implement a data source in Python and then make it usable across all languages Spark supports). I think we should find a way to make this reusable beyond Python (especially for SQL).
Python is the most popular programming language by a large margin, in general and among Spark users. Many of the organizations that use Spark often don't even have a single person that knows Scala. What if they want to implement a custom data source to fetch some data? Today we'd have to tell them to learn Scala/Java and the fairly complex data source API (v1 or v2). Maciej - I understand your concern about endpoint throttling etc. And it goes much more beyond querying REST endpoints. I personally had that concern too when we were adding the JDBC data source (what if somebody launches a 512 node Spark cluster to query my single node MySQL cluster?!). But the built-in JDBC data source is one of the most popular data sources (I just looked up its usage on Databricks and it's by far the #1 data source outside of files, used by > 10000 organizations everyday). On Sun, Jun 25, 2023 at 1:38 AM, Maciej < mszymkiew...@gmail.com > wrote: > > > > Thanks for your feedback Martin. > > However, if the primary intended purpose of this API is to provide an > interface for endpoint querying, then I find this proposal even less > convincing. > > > > Neither the Spark execution model nor the data source API (full or > restricted as proposed here) are a good fit for handling problems arising > from massive endpoint requests, including, but not limited to, handling > quotas and rate limiting. > > > > Consistency and streamlined development are, of course, valuable. > Nonetheless, they are not sufficient, especially if they cannot deliver > the expected user experience in terms of reliability and execution cost. > > > > > > > Best regards, > Maciej Szymkiewicz > > Web: https:/ / zero323. net ( https://zero323.net ) > PGP: A30CEF0C31A501EC > On 6/24/23 23:42, Martin Grund wrote: > > >> Hey, >> >> >> I would like to express my strong support for Python Data Sources even >> though they might not be immediately as powerful as Scala-based data >> sources. One element that is easily lost in this discussion is how much >> faster the iteration speed is with Python compared to Scala. Due to the >> dynamic nature of Python, you can design and build a data source while >> running in a notebook and continuously change the code until it works as >> you want. This behavior is unparalleled! >> >> >> There exists a litany of Python libraries connecting to all kinds of >> different endpoints that could provide data that is usable with Spark. I >> personally can imagine implementing a data source on top of the AWS SDK to >> extract EC2 instance information. Now I don't have to switch tools and can >> keep my pipeline consistent. >> >> >> Let's say you want to query an API in parallel from Spark using Python, today >> 's way would be to create a Python RDD and then implement the planning and >> execution process manually. Finally calling `toDF` in the end. While the >> actual code of the DS and the RDD-based implementation would be very >> similar, the abstraction that is provided by the DS is much more powerful >> and future-proof. Performing dynamic partition elimination, and filter >> push-down can all be implemented at a later point in time. >> >> >> Comparing a DS to using batch calling from a UDF is not great because, the >> execution pattern would be very brittle. Imagine something like >> `spark.range(10).withColumn("data", >> fetch_api).explode(col("data")).collect()`. Here you're encoding >> partitioning logic and data transformation in simple ways, but you can't >> reason about the structural integrity of the query and tiny changes in the >> UDF interface might already cause a lot of downstream issues. >> >> >> >> >> Martin >> >> >> >> On Sat, Jun 24 , 2023 at 1:44 AM Maciej < mszymkiewicz@ gmail. com ( >> mszymkiew...@gmail.com ) > wrote: >> >> >>> >>> >>> With such limited scope (both language availability and features) do we >>> have any representative examples of sources that could significantly >>> benefit from providing this API, compared other available options, such >>> as batch imports, direct queries from vectorized UDFs or even interfacing >>> sources through 3rd party FDWs? >>> >>> >>> Best regards, >>> Maciej Szymkiewicz >>> >>> Web: https:/ / zero323. net ( https://zero323.net ) >>> PGP: A30CEF0C31A501EC >>> On 6/20/23 16:23, Wenchen Fan wrote: >>> >>> >>>> In an ideal world, every data source you want to connect to already has a >>>> Spark data source implementation (either v1 or v2), then this Python API >>>> is useless. But I feel it's common that people want to do quick data >>>> exploration, and the target data system is not popular enough to have an >>>> existing Spark data source implementation. It will be useful if people can >>>> quickly implement a Spark data source using their favorite Python >>>> language. >>>> >>>> >>>> I'm +1 to this proposal, assuming that we will keep it simple and won't >>>> copy all the complicated features we built in DS v2 to this new Python >>>> API. >>>> >>>> On Tue, Jun 20 , 2023 at 2:11 PM Maciej < mszymkiewicz@ gmail. com ( >>>> mszymkiew...@gmail.com ) > wrote: >>>> >>>> >>>>> Similarly to Jacek, I feel it fails to document an actual community need >>>>> for such a feature. >>>>> >>>>> >>>>> Currently, any data source implementation has the potential to benefit >>>>> Spark users across all supported and third-party clients. For generally >>>>> available sources, this is advantageous for the whole Spark community and >>>>> avoids creating 1st and 2nd-tier citizens. This is even more important >>>>> with new officially supported languages being added through connect. >>>>> >>>>> >>>>> Instead, we might rather document in detail the process of implementing a >>>>> new source using current APIs and work towards easily extensible or >>>>> customizable sources, in case there is such a need. >>>>> >>>>> -- >>>>> Best regards, >>>>> Maciej Szymkiewicz >>>>> >>>>> Web: https:/ / zero323. net ( https://zero323.net ) >>>>> PGP: A30CEF0C31A501EC >>>>> >>>>> >>>>> On 6/20/23 05:19, Hyukjin Kwon wrote: >>>>> >>>>> >>>>>> Actually I support this idea in a way that Python developers don't have >>>>>> to >>>>>> learn Scala to write their own source (and separate packaging). >>>>>> This is more crucial especially when you want to write a simple data >>>>>> source that interacts with the Python ecosystem. >>>>>> >>>>>> On Tue , 20 Jun 2023 at 03:08, Denny Lee < denny. g. lee@ gmail. com ( >>>>>> denny.g....@gmail.com ) > wrote: >>>>>> >>>>>> >>>>>>> Slightly biased, but per my conversations - this would be awesome to >>>>>>> have! >>>>>>> >>>>>>> >>>>>>> On Mon, Jun 19 , 2023 at 09:43 Abdeali Kothari < abdealikothari@ gmail. >>>>>>> com >>>>>>> ( abdealikoth...@gmail.com ) > wrote: >>>>>>> >>>>>>> >>>>>>>> I would definitely use it - is it's available :) >>>>>>>> >>>>>>>> On Mon , 19 Jun 2023 , 21:56 Jacek Laskowski, < jacek@ japila. pl ( >>>>>>>> ja...@japila.pl ) > wrote: >>>>>>>> >>>>>>>> >>>>>>>>> Hi Allison and devs, >>>>>>>>> >>>>>>>>> >>>>>>>>> Although I was against this idea at first sight (probably because I'm >>>>>>>>> a >>>>>>>>> Scala dev), I think it could work as long as there are people who'd be >>>>>>>>> interested in such an API. Were there any? I'm just curious. I've >>>>>>>>> seen no >>>>>>>>> emails requesting it. >>>>>>>>> >>>>>>>>> >>>>>>>>> I also doubt that Python devs would like to work on new data sources >>>>>>>>> but >>>>>>>>> support their wishes wholeheartedly :) >>>>>>>>> >>>>>>>>> Pozdrawiam, >>>>>>>>> Jacek Laskowski >>>>>>>>> ---- >>>>>>>>> "The Internals Of" Online Books ( https://books.japila.pl/ ) >>>>>>>>> >>>>>>>>> Follow me on https:/ / twitter. com/ jaceklaskowski ( >>>>>>>>> https://twitter.com/jaceklaskowski ) >>>>>>>>> >>>>>>>>> >>>>>>>>> ( https://twitter.com/jaceklaskowski ) >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Fri, Jun 16 , 2023 at 6:14 AM Allison Wang <allison. wang@ >>>>>>>>> databricks. com. >>>>>>>>> invalid> ( allison.w...@databricks.com.invalid ) wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>>> Hi everyone, >>>>>>>>>> >>>>>>>>>> I would like to start a discussion on “Python Data Source API”. >>>>>>>>>> >>>>>>>>>> This proposal aims to introduce a simple API in Python for Data >>>>>>>>>> Sources. >>>>>>>>>> The idea is to enable Python developers to create data sources >>>>>>>>>> without >>>>>>>>>> having to learn Scala or deal with the complexities of the current >>>>>>>>>> data >>>>>>>>>> source APIs. The goal is to make a Python-based API that is simple >>>>>>>>>> and >>>>>>>>>> easy to use, thus making Spark more accessible to the wider Python >>>>>>>>>> developer community. This proposed approach is based on the recently >>>>>>>>>> introduced Python user-defined table functions with extensions to >>>>>>>>>> support >>>>>>>>>> data sources. >>>>>>>>>> >>>>>>>>>> *SPIP Doc* : https:/ / docs. google. com/ document/ d/ >>>>>>>>>> 1oYrCKEKHzznljYfJO4kx5K_Npcgt1Slyfph3NEk7JRU/ >>>>>>>>>> edit?usp=sharing ( >>>>>>>>>> https://docs.google.com/document/d/1oYrCKEKHzznljYfJO4kx5K_Npcgt1Slyfph3NEk7JRU/edit?usp=sharing >>>>>>>>>> ) >>>>>>>>>> >>>>>>>>>> *SPIP JIRA* : https:/ / issues. apache. org/ jira/ browse/ >>>>>>>>>> SPARK-44076 ( >>>>>>>>>> https://issues.apache.org/jira/browse/SPARK-44076 ) >>>>>>>>>> >>>>>>>>>> Looking forward to your feedback. >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Allison >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>> >>>> >>> >>> >> >> > >
smime.p7s
Description: S/MIME Cryptographic Signature