I don't have any strong views, so just to highlight possible issues:
* Based on different issues I've seen there is a substantial amount of
users which depend on system wide Python installations. As far as I
am aware neither Py4j nor cloudpickle are present in the standard
system repositories in Debian or Red Hat derivatives.
* Assuming that Spark is committed to supporting Python 2 beyond its
end of life we have to be sure that any external dependency has the
same policy.
* Py4j is missing from default Anaconda channel. Not a big issue, just
a small annoyance.
* External dependencies with pinned versions add some overhead to the
development across versions (effectively we may need a separate env
for each major Spark release). I've seen small inconsistencies in
PySpark behavior with different Py4j versions so it is not
completely hypothetical.
* Adding possible version conflicts. It is probably not a big risk but
something to consider (for example in combination Blaze + Dask +
PySpark).
* Adding another party user has to trust.
On 02/14/2017 12:22 AM, Holden Karau wrote:
> It's a good question. Py4J seems to have been updated 5 times in 2016
> and is a bit involved (from a review point of view verifying the zip
> file contents is somewhat tedious).
>
> cloudpickle is a bit difficult to tell since we can have changes to
> cloudpickle which aren't correctly tagged as backporting changes from
> the fork (and this can take awhile to review since we don't always
> catch them right away as being backports).
>
> Another difficulty with looking at backports is that since our review
> process for PySpark has historically been on the slow side, changes
> benefiting systems like dask or IPython parallel were not backported
> to Spark unless they caused serious errors.
>
> I think the key benefits are better test coverage of the forked
> version of cloudpickle, using a more standardized packaging of
> dependencies, simpler updates of dependencies reduces friction to
> gaining benefits from other related projects work - Python
> serialization really isn't our secret sauce.
>
> If I'm missing any substantial benefits or costs I'd love to know :)
>
> On Mon, Feb 13, 2017 at 3:03 PM, Reynold Xin <[email protected]
> <mailto:[email protected]>> wrote:
>
> With any dependency update (or refactoring of existing code), I
> always ask this question: what's the benefit? In this case it
> looks like the benefit is to reduce efforts in backports. Do you
> know how often we needed to do those?
>
>
> On Tue, Feb 14, 2017 at 12:01 AM, Holden Karau
> <[email protected] <mailto:[email protected]>> wrote:
>
> Hi PySpark Developers,
>
> Cloudpickle is a core part of PySpark, and is originally
> copied from (and improved from) picloud. Since then other
> projects have found cloudpickle useful and a fork of
> cloudpickle <https://github.com/cloudpipe/cloudpickle> was
> created and is now maintained as its own library
> <https://pypi.python.org/pypi/cloudpickle> (with better test
> coverage and resulting bug fixes I understand). We've had a
> few PRs backporting fixes from the cloudpickle project into
> Spark's local copy of cloudpickle - how would people feel
> about moving to taking an explicit (pinned) dependency on
> cloudpickle?
>
> We could add cloudpickle to the setup.py and a
> requirements.txt file for users who prefer not to do a system
> installation of PySpark.
>
> Py4J is maybe even a simpler case, we currently have a zip of
> py4j in our repo but could instead have a pinned version
> required. While we do depend on a lot of py4j internal APIs,
> version pinning should be sufficient to ensure functionality
> (and simplify the update process).
>
> Cheers,
>
> Holden :)
>
> --
> Twitter: https://twitter.com/holdenkarau
> <https://twitter.com/holdenkarau>
>
>
>
>
>
> --
> Cell : 425-233-8271
> Twitter: https://twitter.com/holdenkarau
--
Maciej Szymkiewicz