[
https://issues.apache.org/jira/browse/SPARK-17333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17216345#comment-17216345
]
Apache Spark commented on SPARK-17333:
--------------------------------------
User 'Fokko' has created a pull request for this issue:
https://github.com/apache/spark/pull/30088
> Make pyspark interface friendly with mypy static analysis
> ---------------------------------------------------------
>
> Key: SPARK-17333
> URL: https://issues.apache.org/jira/browse/SPARK-17333
> Project: Spark
> Issue Type: Improvement
> Components: PySpark
> Reporter: Assaf Mendelson
> Priority: Trivial
>
> Static analysis tools such as those common to IDE for auto completion and
> error marking, tend to have poor results with pyspark.
> This is cause by two separate issues:
> The first is that many elements are created programmatically such as the max
> function in pyspark.sql.functions.
> The second is that we tend to use pyspark in a functional manner, meaning
> that we chain many actions (e.g. df.filter().groupby().agg()....) and since
> python has no type information this can become difficult to understand.
> I would suggest changing the interface to improve it.
> The way I see it we can either change the interface or provide interface
> enhancements.
> Changing the interface means defining (when possible) all functions directly,
> i.e. instead of having a __functions__ dictionary in pyspark.sql.functions.py
> and then generating the functions programmatically by using _create_function,
> create the function directly.
> def max(col):
> """
> docstring
> """
> _create_function(max,"docstring")
> Second we can add type indications to all functions as defined in pep 484 or
> pycharm's legacy type hinting
> (https://www.jetbrains.com/help/pycharm/2016.1/type-hinting-in-pycharm.html#legacy).
> So for example max might look like this:
> def max(col):
> """
> does a max.
> :type col: Column
> :rtype Column
> """
> This would provide a wide range of support as these types of hints, while old
> are pretty common.
> A second option is to use PEP 3107 to define interfaces (pyi files)
> in this case we might have a functions.pyi file which would contain something
> like:
> def max(col: Column) -> Column:
> """
> Aggregate function: returns the maximum value of the expression in a
> group.
> """
> ...
> This has the advantage of easier to understand types and not touching the
> code (only supported code) but has the disadvantage of being separately
> managed (i.e. greater chance of doing a mistake) and the fact that some
> configuration would be needed in the IDE/static analysis tool instead of
> working out of the box.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]