Hi Maciej, Hyukjin, Did you find any time to discuss adding the types to the Python repository? Would love to know what came out of it.
Cheers, Fokko Op wo 5 aug. 2020 om 10:14 schreef Driesprong, Fokko <fo...@driesprong.frl>: > Mostly echoing stuff that we've discussed in > https://github.com/apache/spark/pull/29180, but good to have this also on > the dev-list. > > > So IMO maintaining outside in a separate repo is going to be harder. > That was why I asked. > > I agree with Felix, having this inside of the project would make it much > easier to maintain. Having it inside of the ASF might be easier to port the > pyi files to the actual Spark repository. > > > FWIW, NumPy took this approach. they made a separate repo, and merged it > into the main repo after it became stable. > > As Maciej pointed out: > > > As of POC ‒ we have stubs, which have been maintained over three years > now and cover versions between 2.3 (though these are fairly limited) to, > with some lag, current master. > > What would be required to mark it as stable? > > > I guess all depends on how we envision the future of annotations > (including, but not limited to, how conservative we want to be in the > future). Which is probably something that should be discussed here. > > I'm happy to motivate people to contribute type hints, and I believe it is > a very accessible way to get more people involved in the Python codebase. > Using the ASF model we can ensure that we require committers/PMC to sign > off on the annotations. > > > Indeed, though the possible advantage is that in theory, you can have > different release cycle than for the main repo (I am not sure if that's > feasible in practice or if that was the intention). > > Personally, I don't think we need a different cycle if the type hints are > part of the code itself. > > > If my understanding is correct, pyspark-stubs is still incomplete and > does not annotate types in some other APIs (by using Any). Correct me if I > am wrong, Maciej. > > For me, it is a bit like code coverage. You want this to be high to make > sure that you cover most of the APIs, but it will take some time to make it > complete. > > For me, it feels a bit like a chicken and egg problem. Because the type > hints are in a separate repository, they will always lag behind. Also, it > is harder to spot where the gaps are. > > Cheers, Fokko > > > > Op wo 5 aug. 2020 om 05:51 schreef Hyukjin Kwon <gurwls...@gmail.com>: > >> Oh I think I caused some confusion here. >> Just for clarification, I wasn’t saying we must port this into a separate >> repo now. I was saying it can be one of the options we can consider. >> >> For a bit of more context: >> This option was considered as, roughly speaking, an invalid option and it >> might need an incubation process as a separate project. >> After some investigations, I found that this is still a valid option and >> we can take this as the part of Apache Spark but in a separate repo. >> >> FWIW, NumPy took this approach. they made a separate repo >> <https://github.com/numpy/numpy-stubs>, and merged it into the main repo >> <https://github.com/numpy/numpy-stubs> after it became stable. >> >> >> My only major concerns are: >> >> - the possibility to fundamentally change the approach in >> pyspark-stubs <https://github.com/zero323/pyspark-stubs>. It’s not >> because how it was done is wrong but because how Python type hinting >> itself >> evolves. >> - If my understanding is correct, pyspark-stubs >> <https://github.com/zero323/pyspark-stubs> is still incomplete and >> does not annotate types in some other APIs (by using Any). Correct me if I >> am wrong, Maciej. >> >> I’ll have a short sync with him and share to understand better since he’d >> probably know the context best in PySpark type hints and I know some >> contexts in ASF and Apache Spark. >> >> >> >> 2020년 8월 5일 (수) 오전 6:31, Maciej Szymkiewicz <mszymkiew...@gmail.com>님이 >> 작성: >> >>> Indeed, though the possible advantage is that in theory, you can have >>> different release cycle than for the main repo (I am not sure if that's >>> feasible in practice or if that was the intention). >>> >>> I guess all depends on how we envision the future of annotations >>> (including, but not limited to, how conservative we want to be in the >>> future). Which is probably something that should be discussed here. >>> On 8/4/20 11:06 PM, Felix Cheung wrote: >>> >>> So IMO maintaining outside in a separate repo is going to be harder. >>> That was why I asked. >>> >>> >>> >>> ------------------------------ >>> *From:* Maciej Szymkiewicz <mszymkiew...@gmail.com> >>> <mszymkiew...@gmail.com> >>> *Sent:* Tuesday, August 4, 2020 12:59 PM >>> *To:* Sean Owen >>> *Cc:* Felix Cheung; Hyukjin Kwon; Driesprong, Fokko; Holden Karau; >>> Spark Dev List >>> *Subject:* Re: [PySpark] Revisiting PySpark type annotations >>> >>> >>> On 8/4/20 9:35 PM, Sean Owen wrote >>> > Yes, but the general argument you make here is: if you tie this >>> > project to the main project, it will _have_ to be maintained by >>> > everyone. That's good, but also exactly I think the downside we want >>> > to avoid at this stage (I thought?) I understand for some >>> > undertakings, it's just not feasible to start outside the main >>> > project, but is there no proof of concept even possible before taking >>> > this step -- which more or less implies it's going to be owned and >>> > merged and have to be maintained in the main project. >>> >>> >>> I think we have a bit different understanding here ‒ I believe we have >>> reached a conclusion that maintaining annotations within the project is >>> OK, we only differ when it comes to specific form it should take. >>> >>> As of POC ‒ we have stubs, which have been maintained over three years >>> now and cover versions between 2.3 (though these are fairly limited) to, >>> with some lag, current master. There is some evidence there are used in >>> the wild >>> ( >>> https://github.com/zero323/pyspark-stubs/network/dependents?package_id=UGFja2FnZS02MzU1MTc4Mg%3D%3D >>> ), >>> there are a few contributors >>> (https://github.com/zero323/pyspark-stubs/graphs/contributors) and at >>> least some use cases (https://stackoverflow.com/q/40163106/). So, >>> subjectively speaking, it seems we're already beyond POC. >>> >>> -- >>> Best regards, >>> Maciej Szymkiewicz >>> >>> Web: https://zero323.net >>> Keybase: https://keybase.io/zero323 >>> Gigs: https://www.codementor.io/@zero323 >>> PGP: A30CEF0C31A501EC >>> >>> >>> -- >>> Best regards, >>> Maciej Szymkiewicz >>> >>> Web: https://zero323.net >>> Keybase: https://keybase.io/zero323 >>> Gigs: https://www.codementor.io/@zero323 >>> PGP: A30CEF0C31A501EC >>> >>>