Hi,

Integrating Koalas with pyspark might help enable a richer integration
between the two. Something that would be useful with a tighter
integration is support for custom column array types. Currently, Spark
takes dataframes, converts them to arrow buffers then transmits them
over the socket to Python. On the other side, pyspark takes the arrow
buffer and converts it to a Pandas dataframe. Unfortunately, the
default Pandas representation of a list-type for a column causes it to
turn what was contiguous value/offset arrays in Arrow into
deserialized Python objects for each row. Obviously, this kills
performance.

A PR to extend the pyspark API to elide the pandas conversion
(https://github.com/apache/spark/pull/26783) was submitted and
rejected, which is unfortunate, but perhaps this proposed integration
would provide the hooks via Pandas' ExtensionArray interface to allow
Spark to performantly interchange jagged/ragged lists to/from python
UDFs.

Cheers
Andrew

On Tue, Mar 16, 2021 at 8:15 PM Hyukjin Kwon <gurwls...@gmail.com> wrote:
>
> Thank you guys for all your feedback. I will start working on SPIP with 
> Koalas team.
> I would expect the SPIP can be sent late this week or early next week.
>
>
> I inlined and answered the questions unanswered as below:
>
> Is the community developing the pandas API layer for Spark interested in 
> being part of Spark or do they prefer having their own release cycle?
>
> Yeah, Koalas team used to have its own release cycle to develop and move 
> quickly.
> Now it became pretty mature with reaching 1.7.0, and the team thinks that 
> it’s now
> fine to have less frequent releases, and they are happy to work together with 
> Spark with
> contributing to it. The active contributors in the Koalas community will 
> continue to
> make the contributions in Spark.
>
> How about test code? Does it fit into the PySpark test framework?
>
> Yes, this will be one of the places where it needs some efforts. Koalas 
> currently uses pytest
> with various dependency version combinations (e.g., Python version, conda vs 
> pip) whereas
> PySpark uses the plain unittests with less dependency version combinations.
>
> For pytest in Koalas <> unittests in PySpark:
>
>   I am currently thinking we will have to convert the Koalas tests to use 
> unittests to match
>   with PySpark for now.
>   It is a feasible option for PySpark to migrate to pytest too but it will 
> need extra effort to
>   make it working with our own PySpark testing framework seamlessly.
>   Koalas team (presumably and likely I) will take a look in any event.
>
> For the combinations of dependency versions:
>
>   Due to the lack of the resources in GitHub Actions, I currently plan to 
> just add the
>   Koalas tests into the matrix PySpark is currently using.
>
> one question I have; what’s an initial goal of the proposal?
> Is that to port all the pandas interfaces that Koalas has already implemented?
> Or, the basic set of them?
>
> The goal of the proposal is to port all of Koalas project into PySpark.
> For example,
>
> import koalas
>
> will be equivalent to
>
> # Names, etc. might change in the final proposal or during the review
> from pyspark.sql import pandas
>
> Koalas supports pandas APIs with a separate layer to cover a bit of 
> difference between
> DataFrame structures in pandas and PySpark, e.g.) other types as column names 
> (labels),
> index (something like row number in DBMSs) and so on. So I think it would 
> make more sense
> to port the whole layer instead of a subset of the APIs.
>
>
>
>
>
> 2021년 3월 17일 (수) 오전 12:32, Wenchen Fan <cloud0...@gmail.com>님이 작성:
>>
>> +1, it's great to have Pandas support in Spark out of the box.
>>
>> On Tue, Mar 16, 2021 at 10:12 PM Takeshi Yamamuro <linguin....@gmail.com> 
>> wrote:
>>>
>>> +1; the pandas interfaces are pretty popular and supporting them in pyspark 
>>> looks promising, I think.
>>> one question I have; what's an initial goal of the proposal?
>>> Is that to port all the pandas interfaces that Koalas has already 
>>> implemented?
>>> Or, the basic set of them?
>>>
>>> On Tue, Mar 16, 2021 at 1:44 AM Ismaël Mejía <ieme...@gmail.com> wrote:
>>>>
>>>> +1
>>>>
>>>> Bringing a Pandas API for pyspark to upstream Spark will only bring
>>>> benefits for everyone (more eyes to use/see/fix/improve the API) as
>>>> well as better alignment with core Spark improvements, the extra
>>>> weight looks manageable.
>>>>
>>>> On Mon, Mar 15, 2021 at 4:45 PM Nicholas Chammas
>>>> <nicholas.cham...@gmail.com> wrote:
>>>> >
>>>> > On Mon, Mar 15, 2021 at 2:12 AM Reynold Xin <r...@databricks.com> wrote:
>>>> >>
>>>> >> I don't think we should deprecate existing APIs.
>>>> >
>>>> >
>>>> > +1
>>>> >
>>>> > I strongly prefer Spark's immutable DataFrame API to the Pandas API. I 
>>>> > could be wrong, but I wager most people who have worked with both Spark 
>>>> > and Pandas feel the same way.
>>>> >
>>>> > For the large community of current PySpark users, or users switching to 
>>>> > PySpark from another Spark language API, it doesn't make sense to 
>>>> > deprecate the current API, even by convention.
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>
>>>
>>>
>>> --
>>> ---
>>> Takeshi Yamamuro

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Reply via email to