Okay, seems like we can create a separate repo as apache/spark? e.g.) https://issues.apache.org/jira/browse/INFRA-20470 We can also think about porting the files as are. I will try to have a short sync with the author Maciej, and share what we discussed offline.
2020년 7월 22일 (수) 오후 10:43, Maciej Szymkiewicz <mszymkiew...@gmail.com>님이 작성: > > > W dniu środa, 22 lipca 2020 Driesprong, Fokko <fo...@driesprong.frl> > napisał(a): > >> That's probably one-time overhead so it is not a big issue. In my >> opinion, a bigger one is possible complexity. Annotations tend to introduce >> a lot of cyclic dependencies in Spark codebase. This can be addressed, but >> don't look great. >> >> >> This is not true (anymore). With Python 3.6 you can add string >> annotations -> 'DenseVector', and in the future with Python 3.7 this is >> fixed by having postponed evaluation: >> https://www.python.org/dev/peps/pep-0563/ >> > > As far as I recall linked PEP addresses backrferences not cyclic > dependencies, which weren't a big issue in the first place > > What I mean is a actually cyclic stuff - for example pyspark.context > depends on pyspark.rdd and the other way around. These dependencies are not > explicit at he moment. > > > >> Merging stubs into project structure from the other hand has almost no >> overhead. >> >> >> This feels awkward to me, this is like having the docstring in a separate >> file. In my opinion you want to have the signatures and the functions >> together for transparency and maintainability. >> >> > I guess that's the matter of preference. From maintainability perspective > it is actually much easier to have separate objects. > > For example there are different types of objects that are required for > meaningful checking, which don't really exist in real code (protocols, > aliases, code generated signatures fo let complex overloads) as well as > some monkey patched entities > > Additionally it is often easier to see inconsistencies when typing is > separate. > > However, I am not implying that this should be a persistent state. > > In general I see two non breaking paths here. > > - Merge pyspark-stubs a separate subproject within main spark repo and > keep it in-sync there with common CI pipeline and transfer ownership of > pypi package to ASF > - Move stubs directly into python/pyspark and then apply individual stubs > to .modules of choice. > > Of course, the first proposal could be an initial step for the latter one. > > >> >> I think DBT is a very nice project where they use annotations very well: >> https://github.com/fishtown-analytics/dbt/blob/dev/marian-anderson/core/dbt/graph/queue.py >> >> Also, they left out the types in the docstring, since they are available >> in the annotations itself. >> >> > >> In practice, the biggest advantage is actually support for completion, >> not type checking (which works in simple cases). >> >> >> Agreed. >> >> Would you be interested in writing up the Outreachy proposal for work on >> this? >> >> >> I would be, and also happy to mentor. But, I think we first need to agree >> as a Spark community if we want to add the annotations to the code, and in >> which extend. >> > > > > > >> At some point (in general when things are heavy in generics, which is the >> case here), annotations become somewhat painful to write. >> >> >> That's true, but that might also be a pointer that it is time to refactor >> the function/code :) >> > > That might the case, but it is more often a matter capturing useful > properties combined with requirement to keep things in sync with Scala > counterparts. > > > >> For now, I tend to think adding type hints to the codes make it difficult >> to backport or revert and more difficult to discuss about typing only >> especially considering typing is arguably premature yet. >> >> >> This feels a bit weird to me, since you want to keep this in sync right? >> Do you provide different stubs for different versions of Python? I had to >> look up the literals: https://www.python.org/dev/peps/pep-0586/ >> > > I think it is more about portability between Spark versions > >> >> >> Cheers, Fokko >> > >> Op wo 22 jul. 2020 om 09:40 schreef Maciej Szymkiewicz < >> mszymkiew...@gmail.com>: >> >>> >>> On 7/22/20 3:45 AM, Hyukjin Kwon wrote: >>> > For now, I tend to think adding type hints to the codes make it >>> > difficult to backport or revert and >>> > more difficult to discuss about typing only especially considering >>> > typing is arguably premature yet. >>> >>> About being premature ‒ since typing ecosystem evolves much faster than >>> Spark it might be preferable to keep annotations as a separate project >>> (preferably under AST / Spark umbrella). It allows for faster iterations >>> and supporting new features (for example Literals proved to be very >>> useful), without waiting for the next Spark release. >>> >>> -- >>> Best regards, >>> Maciej Szymkiewicz >>> >>> Web: https://zero323.net >>> Keybase: https://keybase.io/zero323 >>> Gigs: https://www.codementor.io/@zero323 >>> PGP: A30CEF0C31A501EC >>> >>> >>> > > -- > > Best regards, > Maciej Szymkiewicz > > >