It's a good question. Py4J seems to have been updated 5 times in 2016 and
is a bit involved (from a review point of view verifying the zip file
contents is somewhat tedious).

cloudpickle is a bit difficult to tell since we can have changes to
cloudpickle which aren't correctly tagged as backporting changes from the
fork (and this can take awhile to review since we don't always catch them
right away as being backports).

Another difficulty with looking at backports is that since our review
process for PySpark has historically been on the slow side, changes
benefiting systems like dask or IPython parallel were not backported to
Spark unless they caused serious errors.

I think the key benefits are better test coverage of the forked version of
cloudpickle, using a more standardized packaging of dependencies, simpler
updates of dependencies reduces friction to gaining benefits from other
related projects work - Python serialization really isn't our secret sauce.

If I'm missing any substantial benefits or costs I'd love to know :)

On Mon, Feb 13, 2017 at 3:03 PM, Reynold Xin <r...@databricks.com> wrote:

> With any dependency update (or refactoring of existing code), I always ask
> this question: what's the benefit? In this case it looks like the benefit
> is to reduce efforts in backports. Do you know how often we needed to do
> those?
>
>
> On Tue, Feb 14, 2017 at 12:01 AM, Holden Karau <hol...@pigscanfly.ca>
> wrote:
>
>> Hi PySpark Developers,
>>
>> Cloudpickle is a core part of PySpark, and is originally copied from (and
>> improved from) picloud. Since then other projects have found cloudpickle
>> useful and a fork of cloudpickle
>> <https://github.com/cloudpipe/cloudpickle> was created and is now
>> maintained as its own library <https://pypi.python.org/pypi/cloudpickle> 
>> (with
>> better test coverage and resulting bug fixes I understand). We've had a few
>> PRs backporting fixes from the cloudpickle project into Spark's local copy
>> of cloudpickle - how would people feel about moving to taking an explicit
>> (pinned) dependency on cloudpickle?
>>
>> We could add cloudpickle to the setup.py and a requirements.txt file for
>> users who prefer not to do a system installation of PySpark.
>>
>> Py4J is maybe even a simpler case, we currently have a zip of py4j in our
>> repo but could instead have a pinned version required. While we do depend
>> on a lot of py4j internal APIs, version pinning should be sufficient to
>> ensure functionality (and simplify the update process).
>>
>> Cheers,
>>
>> Holden :)
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>>
>
>


-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau

Reply via email to