Re: [DISCUSS] SPIP: Python Data Source API

Maciej Sat, 24 Jun 2023 01:45:33 -0700

With such limited scope (both language availability and features) do we have any representative examples of sources that could significantly benefit from providing this API, compared other available options, such as batch imports, direct queries from vectorized UDFs or even interfacing sources through 3rd party FDWs?


Best regards,
Maciej Szymkiewicz


Web:https://zero323.net
PGP: A30CEF0C31A501EC

On 6/20/23 16:23, Wenchen Fan wrote:

In an ideal world, every data source you want to connect to already has a Spark data source implementation (either v1 or v2), then this Python API is useless. But I feel it's common that people want to do quick data exploration, and the target data system is not popular enough to have an existing Spark data source implementation. It will be useful if people can quickly implement a Spark data source using their favorite Python language.

I'm +1 to this proposal, assuming that we will keep it simple and won't copy all the complicated features we built in DS v2 to this new Python API.


On Tue, Jun 20, 2023 at 2:11 PM Maciej <mszymkiew...@gmail.com> wrote:

    Similarly to Jacek, I feel it fails to document an actual
    community need for such a feature.

    Currently, any data source implementation has the potential to
    benefit Spark users across all supported and third-party clients. 
    For generally available sources, this is advantageous for the
    whole Spark community and avoids creating 1st and 2nd-tier
    citizens. This is even more important with new officially
    supported languages being added through connect.

    Instead, we might rather document in detail the process of
    implementing a new source using current APIs and work towards
    easily extensible or customizable sources, in case there is such a
    need.

-- Best regards,

    Maciej Szymkiewicz

    Web:https://zero323.net
    PGP: A30CEF0C31A501EC


    On 6/20/23 05:19, Hyukjin Kwon wrote:

    Actually I support this idea in a way that Python developers
    don't have to learn Scala to write their own source (and separate
    packaging).
    This is more crucial especially when you want to write a simple
    data source that interacts with the Python ecosystem.

    On Tue, 20 Jun 2023 at 03:08, Denny Lee <denny.g....@gmail.com>
    wrote:

        Slightly biased, but per my conversations - this would be
        awesome to have!

        On Mon, Jun 19, 2023 at 09:43 Abdeali Kothari
        <abdealikoth...@gmail.com> wrote:

            I would definitely use it - is it's available :)

            On Mon, 19 Jun 2023, 21:56 Jacek Laskowski,
            <ja...@japila.pl> wrote:

                Hi Allison and devs,

                Although I was against this idea at first sight
                (probably because I'm a Scala dev), I think it
                could work as long as there are people who'd be
                interested in such an API. Were there any? I'm just
                curious. I've seen no emails requesting it.

                I also doubt that Python devs would like to work on
                new data sources but support their wishes
                wholeheartedly :)

                Pozdrawiam,
                Jacek Laskowski
                ----
                "The Internals Of" Online Books
                <https://books.japila.pl/>
                Follow me on https://twitter.com/jaceklaskowski

                <https://twitter.com/jaceklaskowski>


                On Fri, Jun 16, 2023 at 6:14 AM Allison Wang
                <allison.w...@databricks.com.invalid>
                <mailto:allison.w...@databricks.com.invalid> wrote:

                    Hi everyone,

                    I would like to start a discussion on “Python
                    Data Source API”.

                    This proposal aims to introduce a simple API in
                    Python for Data Sources. The idea is to enable
                    Python developers to create data sources without
                    having to learn Scala or deal with the
                    complexities of the current data source APIs. The
                    goal is to make a Python-based API that is simple
                    and easy to use, thus making Spark more
                    accessible to the wider Python developer
                    community. This proposed approach is based on the
                    recently introduced Python user-defined table
                    functions with extensions to support data sources.

                    *SPIP Doc*:
                    
https://docs.google.com/document/d/1oYrCKEKHzznljYfJO4kx5K_Npcgt1Slyfph3NEk7JRU/edit?usp=sharing


                    *SPIP JIRA*:
                    https://issues.apache.org/jira/browse/SPARK-44076

                    Looking forward to your feedback.

                    Thanks,
                    Allison

OpenPGP_signature
Description: OpenPGP digital signature

Re: [DISCUSS] SPIP: Python Data Source API

Reply via email to