Re: [PySpark] Revisiting PySpark type annotations

Driesprong, Fokko Thu, 27 Aug 2020 09:10:39 -0700

Thanks for sharing Hyukjin, however, I'm not sure if we're taking the right
direction.


Today I've updated [SPARK-17333][PYSPARK] Enable mypy on the repository
<https://github.com/apache/spark/pull/29180/> and while doing so I've
noticed that all the methods that aren't in the pyi file are *unable to be
called from other python files*. I was unaware of this effect of the pyi
files. As soon as you create the files, all the methods are shielded from
external access. Feels like going back to cpp :'(

With the current stubs as-is, I already had to add a few public methods to
the serializers.pyi
<https://github.com/zero323/pyspark-stubs/pull/464/files#diff-92c0b8f614297ea5a12f0491ccb4b316>,
not to mention the private methods in the utils.pyi
<https://github.com/zero323/pyspark-stubs/pull/464/files#diff-9dc1ff7de58fe85eead4416952e78b2e>.
This made me nervous, it is easy to forget methods that are solely being
used by external API. If we forget or miss a function, then
the functionality will be unusable by the end-user. Of course, this can be
captured by adding that tests that cover the full public API, but I think
this is very error-prone.

Also, it is very cumbersome to add everything to the pyi file. In practice,
this means copying the method definition from the py file and paste it into
the pyi file. This hurts my developers' heart, as it violates the DRY
principle. Just adding the signature to the py files is so much more sense
to me, as we don't have to copy all the signatures to a separate file, and
we don't forget to make methods public. Not to mention, it would
potentially breaks most of the open PR's to the PySpark codebase.

I see many big projects using regular annotations:
- Pandas:
https://github.com/pandas-dev/pandas/blob/master/pandas/io/parquet.py#L51
- Koalas:
https://github.com/databricks/koalas/blob/master/databricks/koalas/indexes.py#L469
- Apache Airflow:
https://github.com/apache/airflow/blob/master/airflow/executors/celery_executor.py#L278
- Data Build Tool:
https://github.com/fishtown-analytics/dbt/blob/dev/marian-anderson/core/dbt/deps/registry.py#L74

Now knowing the effect of the pyi file, I'm still high in favor of moving
the type definitions inline:

   - This will make the adoption of the type definitions much more smoothly
   since we can do it in smaller iterations, instead of per file (yes, I'm
   looking at you features.py
   <https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py>,
   5kloc+)
   - Not having to worry that we forgot a function and potentially screw up
   a release
   - Ability to both add type hints to public and private methods

Can we add https://issues.apache.org/jira/browse/SPARK-17333 as a sub
ticket of SPARK-32681 <https://issues.apache.org/jira/browse/SPARK-32681>?

Cheers, Fokko




Op do 27 aug. 2020 om 13:43 schreef Hyukjin Kwon <gurwls...@gmail.com>:

> Okay, it took me a while because I had to check the options and
> feasibility we discussed here.
>
> TL;DR: I think we can just port directly pyi files as are into PySpark
> main repository.
>
> I would like to share only the key points here because it looks like I,
> Maciej and people here agree with this direction.
>
> - The stability in PySpark stubs seems pretty okay enough to port directly
> into the main repository.
>     At least it covers the most of user-facing APIs. So there won't be
> many advantages by running it separately, (vs the overhead to make a repo
> and maintain separately)
> - There's a possibility that the type hinting way can be changed
> drastically but it will be manageable given that it will be handled within
> the same pyi files.
> - We'll need some tests for that.
> - We'll make sure there's no external user app breakage by this.
>
> There will likely be some other meta works such as adding tests and/or
> documentation works. So I filed an umbrella JIRA for that SPARK-32681
> <https://issues.apache.org/jira/browse/SPARK-32681>.
> If there's no objections in this direction, I think hopefully we can
> start. Let me know if you guys have thoughts on this.
>
> Thanks!
>
>
>
> 2020년 8월 20일 (목) 오후 8:39, Driesprong, Fokko <fo...@driesprong.frl>님이 작성:
>
>> No worries, thanks for the update!
>>
>> Op do 20 aug. 2020 om 12:50 schreef Hyukjin Kwon <gurwls...@gmail.com>
>>
>>> Yeah, we had a short meeting. I had to check a few other things so some
>>> delays happened. I will share soon.
>>>
>>> 2020년 8월 20일 (목) 오후 7:14, Driesprong, Fokko <fo...@driesprong.frl>님이 작성:
>>>
>>>> Hi Maciej, Hyukjin,
>>>>
>>>> Did you find any time to discuss adding the types to the Python
>>>> repository? Would love to know what came out of it.
>>>>
>>>> Cheers, Fokko
>>>>
>>>> Op wo 5 aug. 2020 om 10:14 schreef Driesprong, Fokko
>>>> <fo...@driesprong.frl>:
>>>>
>>>>> Mostly echoing stuff that we've discussed in
>>>>> https://github.com/apache/spark/pull/29180, but good to have this
>>>>> also on the dev-list.
>>>>>
>>>>> > So IMO maintaining outside in a separate repo is going to be harder.
>>>>> That was why I asked.
>>>>>
>>>>> I agree with Felix, having this inside of the project would make it
>>>>> much easier to maintain. Having it inside of the ASF might be easier to
>>>>> port the pyi files to the actual Spark repository.
>>>>>
>>>>> > FWIW, NumPy took this approach. they made a separate repo, and
>>>>> merged it into the main repo after it became stable.
>>>>>
>>>>> As Maciej pointed out:
>>>>>
>>>>> > As of POC ‒ we have stubs, which have been maintained over three
>>>>> years now and cover versions between 2.3 (though these are fairly limited)
>>>>> to, with some lag, current master.
>>>>>
>>>>> What would be required to mark it as stable?
>>>>>
>>>>> > I guess all depends on how we envision the future of annotations
>>>>> (including, but not limited to, how conservative we want to be in the
>>>>> future). Which is probably something that should be discussed here.
>>>>>
>>>>> I'm happy to motivate people to contribute type hints, and I believe
>>>>> it is a very accessible way to get more people involved in the Python
>>>>> codebase. Using the ASF model we can ensure that we require committers/PMC
>>>>> to sign off on the annotations.
>>>>>
>>>>> > Indeed, though the possible advantage is that in theory, you can
>>>>> have different release cycle than for the main repo (I am not sure if
>>>>> that's feasible in practice or if that was the intention).
>>>>>
>>>>> Personally, I don't think we need a different cycle if the type
>>>>> hints are part of the code itself.
>>>>>
>>>>> > If my understanding is correct, pyspark-stubs is still incomplete
>>>>> and does not annotate types in some other APIs (by using Any). Correct me
>>>>> if I am wrong, Maciej.
>>>>>
>>>>> For me, it is a bit like code coverage. You want this to be high to
>>>>> make sure that you cover most of the APIs, but it will take some time to
>>>>> make it complete.
>>>>>
>>>>> For me, it feels a bit like a chicken and egg problem. Because the
>>>>> type hints are in a separate repository, they will always lag behind. 
>>>>> Also,
>>>>> it is harder to spot where the gaps are.
>>>>>
>>>>> Cheers, Fokko
>>>>>
>>>>>
>>>>>
>>>>> Op wo 5 aug. 2020 om 05:51 schreef Hyukjin Kwon <gurwls...@gmail.com>:
>>>>>
>>>>>> Oh I think I caused some confusion here.
>>>>>> Just for clarification, I wasn’t saying we must port this into a
>>>>>> separate repo now. I was saying it can be one of the options we can
>>>>>> consider.
>>>>>>
>>>>>>
>>>>>> For a bit of more context:
>>>>>> This option was considered as, roughly speaking, an invalid option
>>>>>> and it might need an incubation process as a separate project.
>>>>>> After some investigations, I found that this is still a valid option
>>>>>> and we can take this as the part of Apache Spark but in a separate repo.
>>>>>>
>>>>>>
>>>>>> FWIW, NumPy took this approach. they made a separate repo
>>>>>> <https://github.com/numpy/numpy-stubs>, and merged it into the main
>>>>>> repo <https://github.com/numpy/numpy-stubs> after it became stable.
>>>>>>
>>>>>>
>>>>>>
>>>>>> My only major concerns are:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>    - the possibility to fundamentally change the approach in
>>>>>>    pyspark-stubs <https://github.com/zero323/pyspark-stubs>. It’s
>>>>>>    not because how it was done is wrong but because how Python type 
>>>>>> hinting
>>>>>>    itself evolves.
>>>>>>
>>>>>>    - If my understanding is correct, pyspark-stubs
>>>>>>    <https://github.com/zero323/pyspark-stubs> is still incomplete
>>>>>>    and does not annotate types in some other APIs (by using Any). 
>>>>>> Correct me
>>>>>>    if I am wrong, Maciej.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> I’ll have a short sync with him and share to understand better since
>>>>>> he’d probably know the context best in PySpark type hints and I know some
>>>>>> contexts in ASF and Apache Spark.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> 2020년 8월 5일 (수) 오전 6:31, Maciej Szymkiewicz <mszymkiew...@gmail.com>님이
>>>>>> 작성:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Indeed, though the possible advantage is that in theory, you can
>>>>>>>
>>>>>>> have different release cycle than for the main repo (I am not sure
>>>>>>>
>>>>>>> if that's feasible in practice or if that was the intention).
>>>>>>>
>>>>>>>
>>>>>>> I guess all depends on how we envision the future of annotations
>>>>>>>
>>>>>>> (including, but not limited to, how conservative we want to be in
>>>>>>>
>>>>>>> the future). Which is probably something that should be discussed
>>>>>>>
>>>>>>> here.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 8/4/20 11:06 PM, Felix Cheung wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> So IMO maintaining outside in a separate repo is going
>>>>>>>
>>>>>>> to be harder. That was why I asked.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ------------------------------
>>>>>>>
>>>>>>>
>>>>>>> *From:* Maciej Szymkiewicz
>>>>>>>
>>>>>>> <mszymkiew...@gmail.com> <mszymkiew...@gmail.com>
>>>>>>>
>>>>>>>
>>>>>>> *Sent:* Tuesday, August 4, 2020 12:59 PM
>>>>>>>
>>>>>>>
>>>>>>> *To:* Sean Owen
>>>>>>>
>>>>>>>
>>>>>>> *Cc:* Felix Cheung; Hyukjin Kwon; Driesprong, Fokko;
>>>>>>>
>>>>>>> Holden Karau; Spark Dev List
>>>>>>>
>>>>>>>
>>>>>>> *Subject:* Re: [PySpark] Revisiting PySpark type
>>>>>>>
>>>>>>> annotations
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 8/4/20 9:35 PM, Sean Owen wrote
>>>>>>>
>>>>>>>
>>>>>>> > Yes, but the general argument you make here is: if
>>>>>>>
>>>>>>> you tie this
>>>>>>>
>>>>>>>
>>>>>>> > project to the main project, it will _have_ to be
>>>>>>>
>>>>>>> maintained by
>>>>>>>
>>>>>>>
>>>>>>> > everyone. That's good, but also exactly I think the
>>>>>>>
>>>>>>> downside we want
>>>>>>>
>>>>>>>
>>>>>>> > to avoid at this stage (I thought?) I understand
>>>>>>>
>>>>>>> for some
>>>>>>>
>>>>>>>
>>>>>>> > undertakings, it's just not feasible to start
>>>>>>>
>>>>>>> outside the main
>>>>>>>
>>>>>>>
>>>>>>> > project, but is there no proof of concept even
>>>>>>>
>>>>>>> possible before taking
>>>>>>>
>>>>>>>
>>>>>>> > this step -- which more or less implies it's going
>>>>>>>
>>>>>>> to be owned and
>>>>>>>
>>>>>>>
>>>>>>> > merged and have to be maintained in the main
>>>>>>>
>>>>>>> project.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I think we have a bit different understanding here ‒ I
>>>>>>>
>>>>>>> believe we have
>>>>>>>
>>>>>>>
>>>>>>> reached a conclusion that maintaining annotations within
>>>>>>>
>>>>>>> the project is
>>>>>>>
>>>>>>>
>>>>>>> OK, we only differ when it comes to specific form it
>>>>>>>
>>>>>>> should take.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> As of POC ‒ we have stubs, which have been maintained
>>>>>>>
>>>>>>> over three years
>>>>>>>
>>>>>>>
>>>>>>> now and cover versions between 2.3 (though these are
>>>>>>>
>>>>>>> fairly limited) to,
>>>>>>>
>>>>>>>
>>>>>>> with some lag, current master.  There is some evidence
>>>>>>>
>>>>>>> there are used in
>>>>>>>
>>>>>>>
>>>>>>> the wild
>>>>>>>
>>>>>>>
>>>>>>> (
>>>>>>> https://github.com/zero323/pyspark-stubs/network/dependents?package_id=UGFja2FnZS02MzU1MTc4Mg%3D%3D
>>>>>>> ),
>>>>>>>
>>>>>>>
>>>>>>> there are a few contributors
>>>>>>>
>>>>>>>
>>>>>>> (https://github.com/zero323/pyspark-stubs/graphs/contributors)
>>>>>>>
>>>>>>> and at
>>>>>>>
>>>>>>>
>>>>>>> least some use cases (https://stackoverflow.com/q/40163106/).
>>>>>>>
>>>>>>> So,
>>>>>>>
>>>>>>>
>>>>>>> subjectively speaking, it seems we're already beyond
>>>>>>>
>>>>>>> POC.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>>
>>>>>>> Best regards,
>>>>>>>
>>>>>>>
>>>>>>> Maciej Szymkiewicz
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Web: https://zero323.net
>>>>>>>
>>>>>>>
>>>>>>> Keybase: https://keybase.io/zero323
>>>>>>>
>>>>>>>
>>>>>>> Gigs: https://www.codementor.io/@zero323
>>>>>>>
>>>>>>>
>>>>>>> PGP: A30CEF0C31A501EC
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>> Best regards,
>>>>>>>
>>>>>>> Maciej Szymkiewicz
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Web: https://zero323.net
>>>>>>>
>>>>>>> Keybase: https://keybase.io/zero323
>>>>>>>
>>>>>>> Gigs: https://www.codementor.io/@zero323
>>>>>>>
>>>>>>> PGP: A30CEF0C31A501EC
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>

Re: [PySpark] Revisiting PySpark type annotations

Reply via email to