Re: [DISCUSS] Support pandas API layer on PySpark

Hyukjin Kwon Tue, 16 Mar 2021 18:15:03 -0700

Thank you guys for all your feedback. I will start working on SPIP with
Koalas team.
I would expect the SPIP can be sent late this week or early next week.

I inlined and answered the questions unanswered as below:

Is the community developing the pandas API layer for Spark interested in
being part of Spark or do they prefer having their own release cycle?

Yeah, Koalas team used to have its own release cycle to develop and move
quickly.
Now it became pretty mature with reaching 1.7.0, and the team thinks that
it’s now
fine to have less frequent releases, and they are happy to work together
with Spark with
contributing to it. The active contributors in the Koalas community will
continue to
make the contributions in Spark.

How about test code? Does it fit into the PySpark test framework?

Yes, this will be one of the places where it needs some efforts. Koalas
currently uses pytest
with various dependency version combinations (e.g., Python version, conda
vs pip) whereas
PySpark uses the plain unittests with less dependency version combinations.

For pytest in Koalas <> unittests in PySpark:

  I am currently thinking we will have to convert the Koalas tests to use
unittests to match
  with PySpark for now.
  It is a feasible option for PySpark to migrate to pytest too but it will
need extra effort to
  make it working with our own PySpark testing framework seamlessly.
  Koalas team (presumably and likely I) will take a look in any event.

For the combinations of dependency versions:

  Due to the lack of the resources in GitHub Actions, I currently plan to
just add the
  Koalas tests into the matrix PySpark is currently using.

one question I have; what’s an initial goal of the proposal?
Is that to port all the pandas interfaces that Koalas has already
implemented?
Or, the basic set of them?

The goal of the proposal is to port all of Koalas project into PySpark.
For example,

import koalas

will be equivalent to

# Names, etc. might change in the final proposal or during the review
from pyspark.sql import pandas

Koalas supports pandas APIs with a separate layer to cover a bit of
difference between
DataFrame structures in pandas and PySpark, e.g.) other types as column
names (labels),
index (something like row number in DBMSs) and so on. So I think it would
make more sense
to port the whole layer instead of a subset of the APIs.

2021년 3월 17일 (수) 오전 12:32, Wenchen Fan <cloud0...@gmail.com>님이 작성:

> +1, it's great to have Pandas support in Spark out of the box.
>
> On Tue, Mar 16, 2021 at 10:12 PM Takeshi Yamamuro <linguin....@gmail.com>
> wrote:
>
>> +1; the pandas interfaces are pretty popular and supporting them in
>> pyspark looks promising, I think.
>> one question I have; what's an initial goal of the proposal?
>> Is that to port all the pandas interfaces that Koalas has already
>> implemented?
>> Or, the basic set of them?
>>
>> On Tue, Mar 16, 2021 at 1:44 AM Ismaël Mejía <ieme...@gmail.com> wrote:
>>
>>> +1
>>>
>>> Bringing a Pandas API for pyspark to upstream Spark will only bring
>>> benefits for everyone (more eyes to use/see/fix/improve the API) as
>>> well as better alignment with core Spark improvements, the extra
>>> weight looks manageable.
>>>
>>> On Mon, Mar 15, 2021 at 4:45 PM Nicholas Chammas
>>> <nicholas.cham...@gmail.com> wrote:
>>> >
>>> > On Mon, Mar 15, 2021 at 2:12 AM Reynold Xin <r...@databricks.com>
>>> wrote:
>>> >>
>>> >> I don't think we should deprecate existing APIs.
>>> >
>>> >
>>> > +1
>>> >
>>> > I strongly prefer Spark's immutable DataFrame API to the Pandas API. I
>>> could be wrong, but I wager most people who have worked with both Spark and
>>> Pandas feel the same way.
>>> >
>>> > For the large community of current PySpark users, or users switching
>>> to PySpark from another Spark language API, it doesn't make sense to
>>> deprecate the current API, even by convention.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>

Re: [DISCUSS] Support pandas API layer on PySpark

Reply via email to