Re: [PySpark] Revisiting PySpark type annotations

Felix Cheung Tue, 04 Aug 2020 09:45:30 -0700

What would be the reason for separate git repo?

________________________________
From: Hyukjin Kwon <gurwls...@gmail.com>
Sent: Monday, August 3, 2020 1:58:55 AM
To: Maciej Szymkiewicz <mszymkiew...@gmail.com>
Cc: Driesprong, Fokko <fo...@driesprong.frl>; Holden Karau 
<hol...@pigscanfly.ca>; Spark Dev List <dev@spark.apache.org>
Subject: Re: [PySpark] Revisiting PySpark type annotations

Okay, seems like we can create a separate repo as apache/spark? e.g.)
https://issues.apache.org/jira/browse/INFRA-20470
We can also think about porting the files as are.
I will try to have a short sync with the author Maciej, and share what we
discussed offline.

2020년 7월 22일 (수) 오후 10:43, Maciej Szymkiewicz
<mszymkiew...@gmail.com<mailto:mszymkiew...@gmail.com>>님이 작성:

W dniu środa, 22 lipca 2020 Driesprong, Fokko <fo...@driesprong.frl> napisał(a):
That's probably one-time overhead so it is not a big issue. In my opinion, a
bigger one is possible complexity. Annotations tend to introduce a lot of
cyclic dependencies in Spark codebase. This can be addressed, but don't look
great.

This is not true (anymore). With Python 3.6 you can add string annotations ->
'DenseVector', and in the future with Python 3.7 this is fixed by having
postponed evaluation: https://www.python.org/dev/peps/pep-0563/

As far as I recall linked PEP addresses backrferences not cyclic dependencies,
which weren't a big issue in the first place

What I mean is a actually cyclic stuff - for example pyspark.context depends on
pyspark.rdd and the other way around. These dependencies are not explicit at he
moment.

Merging stubs into project structure from the other hand has almost no overhead.

This feels awkward to me, this is like having the docstring in a separate file.
In my opinion you want to have the signatures and the functions together for
transparency and maintainability.

I guess that's the matter of preference. From maintainability perspective it is
actually much easier to have separate objects.

For example there are different types of objects that are required for
meaningful checking, which don't really exist in real code (protocols, aliases,
code generated signatures fo let complex overloads) as well as some monkey
patched entities

Additionally it is often easier to see inconsistencies when typing is separate.

However, I am not implying that this should be a persistent state.

In general I see two non breaking paths here.

- Merge pyspark-stubs a separate subproject within main spark repo and keep it
in-sync there with common CI pipeline and transfer ownership of pypi package to
ASF
- Move stubs directly into python/pyspark and then apply individual stubs to
.modules of choice.

Of course, the first proposal could be an initial step for the latter one.

I think DBT is a very nice project where they use annotations very well:
https://github.com/fishtown-analytics/dbt/blob/dev/marian-anderson/core/dbt/graph/queue.py

Also, they left out the types in the docstring, since they are available in the
annotations itself.

In practice, the biggest advantage is actually support for completion, not type
checking (which works in simple cases).

Agreed.

Would you be interested in writing up the Outreachy proposal for work on this?

I would be, and also happy to mentor. But, I think we first need to agree as a
Spark community if we want to add the annotations to the code, and in which
extend.

At some point (in general when things are heavy in generics, which is the case
here), annotations become somewhat painful to write.

That's true, but that might also be a pointer that it is time to refactor the
function/code :)

That might the case, but it is more often a matter capturing useful properties
combined with requirement to keep things in sync with Scala counterparts.

For now, I tend to think adding type hints to the codes make it difficult to
backport or revert and more difficult to discuss about typing only especially
considering typing is arguably premature yet.

This feels a bit weird to me, since you want to keep this in sync right? Do you
provide different stubs for different versions of Python? I had to look up the
literals: https://www.python.org/dev/peps/pep-0586/

I think it is more about portability between Spark versions

Cheers, Fokko

Op wo 22 jul. 2020 om 09:40 schreef Maciej Szymkiewicz
<mszymkiew...@gmail.com<mailto:mszymkiew...@gmail.com>>:

On 7/22/20 3:45 AM, Hyukjin Kwon wrote:
> For now, I tend to think adding type hints to the codes make it
> difficult to backport or revert and
> more difficult to discuss about typing only especially considering
> typing is arguably premature yet.

About being premature ‒ since typing ecosystem evolves much faster than
Spark it might be preferable to keep annotations as a separate project
(preferably under AST / Spark umbrella). It allows for faster iterations
and supporting new features (for example Literals proved to be very
useful), without waiting for the next Spark release.

--
Best regards,
Maciej Szymkiewicz

Web: https://zero323.net
Keybase: https://keybase.io/zero323
Gigs: https://www.codementor.io/@zero323
PGP: A30CEF0C31A501EC

Best regards,
Maciej Szymkiewicz

Re: [PySpark] Revisiting PySpark type annotations

Reply via email to