[jira] [Commented] (SPARK-34544) pyspark toPandas() should return pd.DataFrame

Maciej Szymkiewicz (Jira) Mon, 01 Mar 2021 08:22:07 -0800


    [ 
https://issues.apache.org/jira/browse/SPARK-34544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17292996#comment-17292996
 ]


Maciej Szymkiewicz commented on SPARK-34544:
--------------------------------------------

[~ravwojdyla]

> But until pyspark release, how would we monkey patch that change in our 
> projects?

For in-house deployments the easiest way is to actually patch Spark to mark 
return type as {{pandas.core.frame.DataFrame}} and either patch Pandas 
(https://github.com/pandas-dev/pandas/pull/28831) or put extracted stubs in 
{{MYPYPATH}}.

> So in the end it sounds like we have a bunch of suboptimal ideas, how should 
> we proceed?

It seems like it. If there are popular methods which didn't get into protocol 
I'd probably add these as a temporary fix.

Looking forward to Spark 3.2 we can closely monitor Pandas progress ‒ if they 
become PEP 561 we simply drop the protocol. Otherwise we can give Microsoft 
stubs a shot.

> pyspark toPandas() should return pd.DataFrame
> ---------------------------------------------
>
>                 Key: SPARK-34544
>                 URL: https://issues.apache.org/jira/browse/SPARK-34544
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 3.1.1
>            Reporter: Rafal Wojdyla
>            Assignee: Maciej Szymkiewicz
>            Priority: Major
>
> Right now {{toPandas()}} returns {{DataFrameLike}}, which is an incomplete 
> "view" of pandas {{DataFrame}}. Which leads to cases like mypy reporting that 
> certain pandas methods are not present in {{DataFrameLike}}, even tho those 
> methods are valid methods on pandas {{DataFrame}}, which is the actual type 
> of the object. This requires type ignore comments or asserts.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-34544) pyspark toPandas() should return pd.DataFrame

Reply via email to