Re: [PySpark] Revisiting PySpark type annotations

Hyukjin Kwon Thu, 20 Aug 2020 03:50:19 -0700

Yeah, we had a short meeting. I had to check a few other things so some
delays happened. I will share soon.


2020년 8월 20일 (목) 오후 7:14, Driesprong, Fokko <fo...@driesprong.frl>님이 작성:

> Hi Maciej, Hyukjin,
>
> Did you find any time to discuss adding the types to the Python
> repository? Would love to know what came out of it.
>
> Cheers, Fokko
>
> Op wo 5 aug. 2020 om 10:14 schreef Driesprong, Fokko <fo...@driesprong.frl
> >:
>
>> Mostly echoing stuff that we've discussed in
>> https://github.com/apache/spark/pull/29180, but good to have this also
>> on the dev-list.
>>
>> > So IMO maintaining outside in a separate repo is going to be harder.
>> That was why I asked.
>>
>> I agree with Felix, having this inside of the project would make it much
>> easier to maintain. Having it inside of the ASF might be easier to port the
>> pyi files to the actual Spark repository.
>>
>> > FWIW, NumPy took this approach. they made a separate repo, and merged
>> it into the main repo after it became stable.
>>
>> As Maciej pointed out:
>>
>> > As of POC ‒ we have stubs, which have been maintained over three years
>> now and cover versions between 2.3 (though these are fairly limited) to,
>> with some lag, current master.
>>
>> What would be required to mark it as stable?
>>
>> > I guess all depends on how we envision the future of annotations
>> (including, but not limited to, how conservative we want to be in the
>> future). Which is probably something that should be discussed here.
>>
>> I'm happy to motivate people to contribute type hints, and I believe it
>> is a very accessible way to get more people involved in the Python
>> codebase. Using the ASF model we can ensure that we require committers/PMC
>> to sign off on the annotations.
>>
>> > Indeed, though the possible advantage is that in theory, you can have
>> different release cycle than for the main repo (I am not sure if that's
>> feasible in practice or if that was the intention).
>>
>> Personally, I don't think we need a different cycle if the type hints are
>> part of the code itself.
>>
>> > If my understanding is correct, pyspark-stubs is still incomplete and
>> does not annotate types in some other APIs (by using Any). Correct me if I
>> am wrong, Maciej.
>>
>> For me, it is a bit like code coverage. You want this to be high to make
>> sure that you cover most of the APIs, but it will take some time to make it
>> complete.
>>
>> For me, it feels a bit like a chicken and egg problem. Because the type
>> hints are in a separate repository, they will always lag behind. Also, it
>> is harder to spot where the gaps are.
>>
>> Cheers, Fokko
>>
>>
>>
>> Op wo 5 aug. 2020 om 05:51 schreef Hyukjin Kwon <gurwls...@gmail.com>:
>>
>>> Oh I think I caused some confusion here.
>>> Just for clarification, I wasn’t saying we must port this into a
>>> separate repo now. I was saying it can be one of the options we can
>>> consider.
>>>
>>> For a bit of more context:
>>> This option was considered as, roughly speaking, an invalid option and
>>> it might need an incubation process as a separate project.
>>> After some investigations, I found that this is still a valid option and
>>> we can take this as the part of Apache Spark but in a separate repo.
>>>
>>> FWIW, NumPy took this approach. they made a separate repo
>>> <https://github.com/numpy/numpy-stubs>, and merged it into the main repo
>>> <https://github.com/numpy/numpy-stubs> after it became stable.
>>>
>>>
>>> My only major concerns are:
>>>
>>>    - the possibility to fundamentally change the approach in
>>>    pyspark-stubs <https://github.com/zero323/pyspark-stubs>. It’s not
>>>    because how it was done is wrong but because how Python type hinting 
>>> itself
>>>    evolves.
>>>    - If my understanding is correct, pyspark-stubs
>>>    <https://github.com/zero323/pyspark-stubs> is still incomplete and
>>>    does not annotate types in some other APIs (by using Any). Correct me if 
>>> I
>>>    am wrong, Maciej.
>>>
>>> I’ll have a short sync with him and share to understand better since
>>> he’d probably know the context best in PySpark type hints and I know some
>>> contexts in ASF and Apache Spark.
>>>
>>>
>>>
>>> 2020년 8월 5일 (수) 오전 6:31, Maciej Szymkiewicz <mszymkiew...@gmail.com>님이
>>> 작성:
>>>
>>>> Indeed, though the possible advantage is that in theory, you can have
>>>> different release cycle than for the main repo (I am not sure if that's
>>>> feasible in practice or if that was the intention).
>>>>
>>>> I guess all depends on how we envision the future of annotations
>>>> (including, but not limited to, how conservative we want to be in the
>>>> future). Which is probably something that should be discussed here.
>>>> On 8/4/20 11:06 PM, Felix Cheung wrote:
>>>>
>>>> So IMO maintaining outside in a separate repo is going to be harder.
>>>> That was why I asked.
>>>>
>>>>
>>>>
>>>> ------------------------------
>>>> *From:* Maciej Szymkiewicz <mszymkiew...@gmail.com>
>>>> <mszymkiew...@gmail.com>
>>>> *Sent:* Tuesday, August 4, 2020 12:59 PM
>>>> *To:* Sean Owen
>>>> *Cc:* Felix Cheung; Hyukjin Kwon; Driesprong, Fokko; Holden Karau;
>>>> Spark Dev List
>>>> *Subject:* Re: [PySpark] Revisiting PySpark type annotations
>>>>
>>>>
>>>> On 8/4/20 9:35 PM, Sean Owen wrote
>>>> > Yes, but the general argument you make here is: if you tie this
>>>> > project to the main project, it will _have_ to be maintained by
>>>> > everyone. That's good, but also exactly I think the downside we want
>>>> > to avoid at this stage (I thought?) I understand for some
>>>> > undertakings, it's just not feasible to start outside the main
>>>> > project, but is there no proof of concept even possible before taking
>>>> > this step -- which more or less implies it's going to be owned and
>>>> > merged and have to be maintained in the main project.
>>>>
>>>>
>>>> I think we have a bit different understanding here ‒ I believe we have
>>>> reached a conclusion that maintaining annotations within the project is
>>>> OK, we only differ when it comes to specific form it should take.
>>>>
>>>> As of POC ‒ we have stubs, which have been maintained over three years
>>>> now and cover versions between 2.3 (though these are fairly limited) to,
>>>> with some lag, current master.  There is some evidence there are used in
>>>> the wild
>>>> (
>>>> https://github.com/zero323/pyspark-stubs/network/dependents?package_id=UGFja2FnZS02MzU1MTc4Mg%3D%3D
>>>> ),
>>>> there are a few contributors
>>>> (https://github.com/zero323/pyspark-stubs/graphs/contributors) and at
>>>> least some use cases (https://stackoverflow.com/q/40163106/). So,
>>>> subjectively speaking, it seems we're already beyond POC.
>>>>
>>>> --
>>>> Best regards,
>>>> Maciej Szymkiewicz
>>>>
>>>> Web: https://zero323.net
>>>> Keybase: https://keybase.io/zero323
>>>> Gigs: https://www.codementor.io/@zero323
>>>> PGP: A30CEF0C31A501EC
>>>>
>>>>
>>>> --
>>>> Best regards,
>>>> Maciej Szymkiewicz
>>>>
>>>> Web: https://zero323.net
>>>> Keybase: https://keybase.io/zero323
>>>> Gigs: https://www.codementor.io/@zero323
>>>> PGP: A30CEF0C31A501EC
>>>>
>>>>

Re: [PySpark] Revisiting PySpark type annotations

Reply via email to