Thanks for your feedback Martin.

However, if the primary intended purpose of this API is to provide an interface for endpoint querying, then I find this proposal even less convincing.

Neither the Spark execution model nor the data source API (full or restricted as proposed here) are a good fit for handling problems arising from massive endpoint requests, including, but not limited to, handling quotas and rate limiting.

Consistency and streamlined development are, of course, valuable. Nonetheless, they are not sufficient, especially if they cannot deliver the expected user experience in terms of reliability and execution cost.

Best regards,
Maciej Szymkiewicz

Web:https://zero323.net
PGP: A30CEF0C31A501EC

On 6/24/23 23:42, Martin Grund wrote:
Hey,

I would like to express my strong support for Python Data Sources even though they might not be immediately as powerful as Scala-based data sources. One element that is easily lost in this discussion is how much faster the iteration speed is with Python compared to Scala. Due to the dynamic nature of Python, you can design and build a data source while running in a notebook and continuously change the code until it works as you want. This behavior is unparalleled!

There exists a litany of Python libraries connecting to all kinds of different endpoints that could provide data that is usable with Spark. I personally can imagine implementing a data source on top of the AWS SDK to extract EC2 instance information. Now I don't have to switch tools and can keep my pipeline consistent.

Let's say you want to query an API in parallel from Spark using Python, today's way would be to create a Python RDD and then implement the planning and execution process manually. Finally calling `toDF` in the end. While the actual code of the DS and the RDD-based implementation would be very similar, the abstraction that is provided by the DS is much more powerful and future-proof. Performing dynamic partition elimination, and filter push-down can all be implemented at a later point in time.

Comparing a DS to using batch calling from a UDF is not great because, the execution pattern would be very brittle. Imagine something like `spark.range(10).withColumn("data", fetch_api).explode(col("data")).collect()`. Here you're encoding partitioning logic and data transformation in simple ways, but you can't reason about the structural integrity of the query and tiny changes in the UDF interface might already cause a lot of downstream issues.


Martin


On Sat, Jun 24, 2023 at 1:44 AM Maciej <mszymkiew...@gmail.com> wrote:

    With such limited scope (both language availability and features)
    do we have any representative examples of sources that could
    significantly benefit from providing this API,  compared other
    available options, such as batch imports, direct queries from
    vectorized  UDFs or even interfacing sources through 3rd party FDWs?

    Best regards,
    Maciej Szymkiewicz

    Web:https://zero323.net
    PGP: A30CEF0C31A501EC

    On 6/20/23 16:23, Wenchen Fan wrote:
    In an ideal world, every data source you want to connect to
    already has a Spark data source implementation (either v1 or v2),
    then this Python API is useless. But I feel it's common that
    people want to do quick data exploration, and the target data
    system is not popular enough to have an existing Spark data
    source implementation. It will be useful if people can quickly
    implement a Spark data source using their favorite Python language.

    I'm +1 to this proposal, assuming that we will keep it simple and
    won't copy all the complicated features we built in DS v2 to this
    new Python API.

    On Tue, Jun 20, 2023 at 2:11 PM Maciej <mszymkiew...@gmail.com>
    wrote:

        Similarly to Jacek, I feel it fails to document an actual
        community need for such a feature.

        Currently, any data source implementation has the potential
        to benefit Spark users across all supported and third-party
        clients.  For generally available sources, this is
        advantageous for the whole Spark community and avoids
        creating 1st and 2nd-tier citizens. This is even more
        important with new officially supported languages being added
        through connect.

        Instead, we might rather document in detail the process of
        implementing a new source using current APIs and work towards
        easily extensible or customizable sources, in case there is
        such a need.

-- Best regards,
        Maciej Szymkiewicz

        Web:https://zero323.net
        PGP: A30CEF0C31A501EC


        On 6/20/23 05:19, Hyukjin Kwon wrote:
        Actually I support this idea in a way that Python developers
        don't have to learn Scala to write their own source (and
        separate packaging).
        This is more crucial especially when you want to write a
        simple data source that interacts with the Python ecosystem.

        On Tue, 20 Jun 2023 at 03:08, Denny Lee
        <denny.g....@gmail.com> wrote:

            Slightly biased, but per my conversations - this would
            be awesome to have!

            On Mon, Jun 19, 2023 at 09:43 Abdeali Kothari
            <abdealikoth...@gmail.com> wrote:

                I would definitely use it - is it's available :)

                On Mon, 19 Jun 2023, 21:56 Jacek Laskowski,
                <ja...@japila.pl> wrote:

                    Hi Allison and devs,

                    Although I was against this idea at first sight
                    (probably because I'm a Scala dev), I think it
                    could work as long as there are people who'd be
                    interested in such an API. Were there any? I'm
                    just curious. I've seen no emails requesting it.

                    I also doubt that Python devs would like to work
                    on new data sources but support their wishes
                    wholeheartedly :)

                    Pozdrawiam,
                    Jacek Laskowski
                    ----
                    "The Internals Of" Online Books
                    <https://books.japila.pl/>
                    Follow me on https://twitter.com/jaceklaskowski

                    <https://twitter.com/jaceklaskowski>


                    On Fri, Jun 16, 2023 at 6:14 AM Allison Wang
                    <allison.w...@databricks.com.invalid>
                    <mailto:allison.w...@databricks.com.invalid> wrote:

                        Hi everyone,

                        I would like to start a discussion on
                        “Python Data Source API”.

                        This proposal aims to introduce a simple API
                        in Python for Data Sources. The idea is to
                        enable Python developers to create data
                        sources without having to learn Scala or
                        deal with the complexities of the current
                        data source APIs. The goal is to make a
                        Python-based API that is simple and easy to
                        use, thus making Spark more accessible to
                        the wider Python developer community. This
                        proposed approach is based on the recently
                        introduced Python user-defined table
                        functions with extensions to support data
                        sources.

                        *SPIP Doc*:
                        
https://docs.google.com/document/d/1oYrCKEKHzznljYfJO4kx5K_Npcgt1Slyfph3NEk7JRU/edit?usp=sharing


                        *SPIP JIRA*:
                        https://issues.apache.org/jira/browse/SPARK-44076

                        Looking forward to your feedback.

                        Thanks,
                        Allison



Attachment: OpenPGP_signature
Description: OpenPGP digital signature

Reply via email to