Re: [PySpark] Revisiting PySpark type annotations

Hyukjin Kwon Mon, 03 Aug 2020 02:00:03 -0700

Okay, seems like we can create a separate repo as apache/spark? e.g.)
https://issues.apache.org/jira/browse/INFRA-20470
We can also think about porting the files as are.
I will try to have a short sync with the author Maciej, and share what we
discussed offline.



2020년 7월 22일 (수) 오후 10:43, Maciej Szymkiewicz <mszymkiew...@gmail.com>님이 작성:

>
>
> W dniu środa, 22 lipca 2020 Driesprong, Fokko <fo...@driesprong.frl>
> napisał(a):
>
>> That's probably one-time overhead so it is not a big issue.  In my
>> opinion, a bigger one is possible complexity. Annotations tend to introduce
>> a lot of cyclic dependencies in Spark codebase. This can be addressed, but
>> don't look great.
>>
>>
>> This is not true (anymore). With Python 3.6 you can add string
>> annotations -> 'DenseVector', and in the future with Python 3.7 this is
>> fixed by having postponed evaluation:
>> https://www.python.org/dev/peps/pep-0563/
>>
>
> As far as I recall linked PEP addresses backrferences not cyclic
> dependencies, which weren't a big issue in the first place
>
> What I mean is a actually cyclic stuff - for example pyspark.context
> depends on pyspark.rdd and the other way around. These dependencies are not
> explicit at he moment.
>
>
>
>> Merging stubs into project structure from the other hand has almost no
>> overhead.
>>
>>
>> This feels awkward to me, this is like having the docstring in a separate
>> file. In my opinion you want to have the signatures and the functions
>> together for transparency and maintainability.
>>
>>
> I guess that's the matter of preference. From maintainability perspective
> it is actually much easier to have separate objects.
>
> For example there are different types of objects that are required for
> meaningful checking, which don't really exist in real code (protocols,
> aliases, code generated signatures fo let complex overloads) as well as
> some monkey patched entities
>
> Additionally it is often easier to see inconsistencies when typing is
> separate.
>
> However, I am not implying that this should be a persistent state.
>
> In general I see two non breaking paths here.
>
>  - Merge pyspark-stubs a separate subproject within main spark repo and
> keep it in-sync there with common CI pipeline and transfer ownership of
> pypi package to ASF
> - Move stubs directly into python/pyspark and then apply individual stubs
> to .modules of choice.
>
> Of course, the first proposal could be an initial step for the latter one.
>
>
>>
>> I think DBT is a very nice project where they use annotations very well:
>> https://github.com/fishtown-analytics/dbt/blob/dev/marian-anderson/core/dbt/graph/queue.py
>>
>> Also, they left out the types in the docstring, since they are available
>> in the annotations itself.
>>
>>
>
>> In practice, the biggest advantage is actually support for completion,
>> not type checking (which works in simple cases).
>>
>>
>> Agreed.
>>
>> Would you be interested in writing up the Outreachy proposal for work on
>> this?
>>
>>
>> I would be, and also happy to mentor. But, I think we first need to agree
>> as a Spark community if we want to add the annotations to the code, and in
>> which extend.
>>
>
>
>
>
>
>> At some point (in general when things are heavy in generics, which is the
>> case here), annotations become somewhat painful to write.
>>
>>
>> That's true, but that might also be a pointer that it is time to refactor
>> the function/code :)
>>
>
> That might the case, but it is more often a matter capturing useful
> properties combined with requirement to keep things in sync with Scala
> counterparts.
>
>
>
>> For now, I tend to think adding type hints to the codes make it difficult
>> to backport or revert and more difficult to discuss about typing only
>> especially considering typing is arguably premature yet.
>>
>>
>> This feels a bit weird to me, since you want to keep this in sync right?
>> Do you provide different stubs for different versions of Python? I had to
>> look up the literals: https://www.python.org/dev/peps/pep-0586/
>>
>
> I think it is more about portability between Spark versions
>
>>
>>
>> Cheers, Fokko
>>
>
>> Op wo 22 jul. 2020 om 09:40 schreef Maciej Szymkiewicz <
>> mszymkiew...@gmail.com>:
>>
>>>
>>> On 7/22/20 3:45 AM, Hyukjin Kwon wrote:
>>> > For now, I tend to think adding type hints to the codes make it
>>> > difficult to backport or revert and
>>> > more difficult to discuss about typing only especially considering
>>> > typing is arguably premature yet.
>>>
>>> About being premature ‒ since typing ecosystem evolves much faster than
>>> Spark it might be preferable to keep annotations as a separate project
>>> (preferably under AST / Spark umbrella). It allows for faster iterations
>>> and supporting new features (for example Literals proved to be very
>>> useful), without waiting for the next Spark release.
>>>
>>> --
>>> Best regards,
>>> Maciej Szymkiewicz
>>>
>>> Web: https://zero323.net
>>> Keybase: https://keybase.io/zero323
>>> Gigs: https://www.codementor.io/@zero323
>>> PGP: A30CEF0C31A501EC
>>>
>>>
>>>
>
> --
>
> Best regards,
> Maciej Szymkiewicz
>
>
>

Re: [PySpark] Revisiting PySpark type annotations

Reply via email to