Re: [DISCUSS] Support pandas API layer on PySpark

2021-03-17 Thread Hyukjin Kwon
Thanks Nicholas for the pointer :-). On Thu, 18 Mar 2021, 00:11 Nicholas Chammas, wrote: > On Tue, Mar 16, 2021 at 9:15 PM Hyukjin Kwon wrote: > >> I am currently thinking we will have to convert the Koalas tests to use >> unittests to match with PySpark for now. >> > Keep in mind that

Re: [DISCUSS] Support pandas API layer on PySpark

2021-03-17 Thread Nicholas Chammas
On Tue, Mar 16, 2021 at 9:15 PM Hyukjin Kwon wrote: > I am currently thinking we will have to convert the Koalas tests to use > unittests to match with PySpark for now. > Keep in mind that pytest supports unittest-based tests out of the box , so

Re: [DISCUSS] Support pandas API layer on PySpark

2021-03-17 Thread Hyukjin Kwon
Yeah, that's a good point, Georg. I think we will port as is first, and discuss further about that indexing system. We should probably either add non-index mode or switch it to a distributed default index type that minimizes the side effect in query plan. We still have some months left. I will

Re: [DISCUSS] Support pandas API layer on PySpark

2021-03-17 Thread Georg Heiler
Would you plan to keep the existing indexing mechanism then? https://koalas.readthedocs.io/en/latest/user_guide/best_practices.html#use-distributed-or-distributed-sequence-default-index For me, it always even when trying to use the distributed version resulted in various window functions being

Re: [DISCUSS] Support pandas API layer on PySpark

2021-03-17 Thread Hyukjin Kwon
> Just out of curiosity, does Koalas pretty much implement all of the Pandas APIs now? If there are some that are yet to be implemented or others that have differences, are these documented so users won't be caught off-guard? It's roughly 75% done so far (in Series, DataFrame and Index). Yeah,

Re: [DISCUSS] Support pandas API layer on PySpark

2021-03-16 Thread Bryan Cutler
+1 the proposal sounds good to me. Having a familiar API built-in will really help new users get into using Spark that might only have Pandas experience. It sounds like maintenance costs should be manageable, once the hurdle with setting up tests is done. Just out of curiosity, does Koalas pretty

Re: [DISCUSS] Support pandas API layer on PySpark

2021-03-16 Thread Andrew Melo
Hi, Integrating Koalas with pyspark might help enable a richer integration between the two. Something that would be useful with a tighter integration is support for custom column array types. Currently, Spark takes dataframes, converts them to arrow buffers then transmits them over the socket to

Re: [DISCUSS] Support pandas API layer on PySpark

2021-03-16 Thread Hyukjin Kwon
Thank you guys for all your feedback. I will start working on SPIP with Koalas team. I would expect the SPIP can be sent late this week or early next week. I inlined and answered the questions unanswered as below: Is the community developing the pandas API layer for Spark interested in being

Re: [DISCUSS] Support pandas API layer on PySpark

2021-03-16 Thread Wenchen Fan
+1, it's great to have Pandas support in Spark out of the box. On Tue, Mar 16, 2021 at 10:12 PM Takeshi Yamamuro wrote: > +1; the pandas interfaces are pretty popular and supporting them in > pyspark looks promising, I think. > one question I have; what's an initial goal of the proposal? > Is

Re: [DISCUSS] Support pandas API layer on PySpark

2021-03-16 Thread Takeshi Yamamuro
+1; the pandas interfaces are pretty popular and supporting them in pyspark looks promising, I think. one question I have; what's an initial goal of the proposal? Is that to port all the pandas interfaces that Koalas has already implemented? Or, the basic set of them? On Tue, Mar 16, 2021 at 1:44

Re: [DISCUSS] Support pandas API layer on PySpark

2021-03-15 Thread Ismaël Mejía
+1 Bringing a Pandas API for pyspark to upstream Spark will only bring benefits for everyone (more eyes to use/see/fix/improve the API) as well as better alignment with core Spark improvements, the extra weight looks manageable. On Mon, Mar 15, 2021 at 4:45 PM Nicholas Chammas wrote: > > On

Re: [DISCUSS] Support pandas API layer on PySpark

2021-03-15 Thread Nicholas Chammas
On Mon, Mar 15, 2021 at 2:12 AM Reynold Xin wrote: > I don't think we should deprecate existing APIs. > +1 I strongly prefer Spark's immutable DataFrame API to the Pandas API. I could be wrong, but I wager most people who have worked with both Spark and Pandas feel the same way. For the large

Re: [DISCUSS] Support pandas API layer on PySpark

2021-03-15 Thread Maciej
I concur. These two don't have the same target audience or expressiveness. I cannot imagine most of the PySpark projects I've seen to switch to Pandas-style API. If this is to be included, it would be great if we could model similar to SQLAlchemy, with its core and ORM components being equally

Re: [DISCUSS] Support pandas API layer on PySpark

2021-03-15 Thread Reynold Xin
I don't think we should deprecate existing APIs. Spark's own Python API is relatively stable and not difficult to support. It has a pretty large number of users and existing code. Also pretty easy to learn by data engineers. pandas API is a great for data science, but isn't that great for some

Re: [DISCUSS] Support pandas API layer on PySpark

2021-03-15 Thread Dongjoon Hyun
Thank you for the proposal. It looks like a good addition. BTW, what is the future plan for the existing APIs? Are we going to deprecate it eventually in favor of Koalas (because we don't remove the existing APIs in general)? > Fourthly, PySpark is still not Pythonic enough. For example, I hear

Re: [DISCUSS] Support pandas API layer on PySpark

2021-03-14 Thread Hyukjin Kwon
Firstly my biggest reason is that I would like to promote this more as a built-in support because it is simply important to have it with the impact on the large user group, and the needs are increasing as the charts indicate. I usually think that features or add-ons stay as third parties when it’s

Re: [DISCUSS] Support pandas API layer on PySpark

2021-03-14 Thread Sean Owen
I like koalas a lot. Playing devil's advocate, why not just let it continue to live as an add on? Usually the argument is it'll be maintained better in Spark but it's well maintained. It adds some overhead to maintaining Spark conversely. On the upside it makes it a little more discoverable. Are

Re: [DISCUSS] Support pandas API layer on PySpark

2021-03-13 Thread Liang-Chi Hsieh
>From Python developer perspective, this direction sounds making sense to me. As pandas is almost the standard library in the related area, if PySpark supports pandas API out of box, the usability would be in a higher level. For maintenance cost, IIUC, there are some Spark committers in the

Re: [DISCUSS] Support pandas API layer on PySpark

2021-03-13 Thread Holden Karau
I think having pandas support inside of Spark makes sense. One of my questions is who are the majour contributors to this effort, is the community developing the pandas API layer for Spark interested in being part of Spark or do they prefer having their own release cycle? On Sat, Mar 13, 2021 at

[DISCUSS] Support pandas API layer on PySpark

2021-03-13 Thread Hyukjin Kwon
Hi all, I would like to start the discussion on supporting pandas API layer on Spark. If we have a general consensus on having it in PySpark, I will initiate and drive an SPIP with a detailed explanation about the implementation’s overview and structure. I would appreciate it if I can know