So it is pretty core, but its one of the better indirectly tested components. I think probably the most reasonable path is to see what the diff ends up looking like and make a call at that point for if we want it to go to master or master & branch-2.3?
On Fri, Jan 19, 2018 at 12:30 AM, Hyukjin Kwon <[email protected]> wrote: > > So given that it fixes some real world bugs, any particular reason why? > Would you be comfortable with doing it in 2.3.1? > > Ah, I don't feel strongly about this but RC2 will be running on and > cloudpickle's quite core fix to PySpark. Just thought we might want to have > enough time with it. > > One worry is, upgrading it includes a fix about namedtuple too where > PySpark has a custom fix. > I would like to check few things about this. > > So, yea, it's vague. I wouldn't stay against if you'd prefer. > > > > > 2018-01-19 16:42 GMT+09:00 Holden Karau <[email protected]>: > >> >> >> On Jan 19, 2018 7:28 PM, "Hyukjin Kwon" <[email protected]> wrote: >> >> > Is it an option to match the latest version of cloudpickle and still >> set protocol level 2? >> >> IMHO, I think this can be an option but I am not fully sure yet if we >> should/could go ahead for it within Spark 2.X. I need some >> investigations including things about Pyrolite. >> >> Let's go ahead with matching it to 0.4.2 first. I am quite clear on >> matching it to 0.4.2 at least. >> >> So given that there is a follow up on which fixes a regression if we're >> not comfortable doing the latest version let's double-check that the >> version we do upgrade to doesn't have that regression. >> >> >> >> > I agree that upgrading to try and match version 0.4.2 would be a good >> starting point. Unless no one objects, I will open up a JIRA and try to do >> this. >> >> Yup but I think we shouldn't make this into Spark 2.3.0 to be clear. >> >> So given that it fixes some real world bugs, any particular reason why? >> Would you be comfortable with doing it in 2.3.1? >> >> >> >> > Also lets try to keep track in our commit messages which version of >> cloudpickle we end up upgrading to. >> >> +1: PR description, commit message or any unit to identify each will be >> useful. >> It should be easier once we have a matched version. >> >> >> >> 2018-01-19 12:55 GMT+09:00 Holden Karau <[email protected]>: >> >>> So if there are different version of Python on the cluster machines I >>> think that's already unsupported so I'm not worried about that. >>> >>> I'd suggest going to the highest released version since there appear to >>> be some useful fixes between 0.4.2 & 0.5.2 >>> >>> Also lets try to keep track in our commit messages which version of >>> cloudpickle we end up upgrading to. >>> >>> On Thu, Jan 18, 2018 at 5:45 PM, Bryan Cutler <[email protected]> wrote: >>> >>>> Thanks for all the details and background Hyukjin! Regarding the pickle >>>> protocol change, if I understand correctly, it is currently at level 2 in >>>> Spark which is good for backwards compatibility for all of Python 2. >>>> Choosing HIGHEST_PROTOCOL, which is the default for cloudpickle 0.5.0 and >>>> above, will pick a level determined by your Python version. So is the >>>> concern here for Spark if someone has different versions of Python in their >>>> cluster, like 3.5 and 3.3, then different protocols will be used and >>>> deserialization might fail? Is it an option to match the latest version of >>>> cloudpickle and still set protocol level 2? >>>> >>>> I agree that upgrading to try and match version 0.4.2 would be a good >>>> starting point. Unless no one objects, I will open up a JIRA and try to do >>>> this. >>>> >>>> Thanks, >>>> Bryan >>>> >>>> On Mon, Jan 15, 2018 at 7:57 PM, Hyukjin Kwon <[email protected]> >>>> wrote: >>>> >>>>> Hi Bryan, >>>>> >>>>> Yup, I support to match the version. I pushed it forward before to >>>>> match it with https://github.com/cloudpipe/cloudpickle >>>>> before few times in Spark's copy and also cloudpickle itself with few >>>>> fixes. I believe our copy is closest to 0.4.1. >>>>> >>>>> I have been trying to follow up the changes in cloudpipe/cloudpickle >>>>> for which version we should match, I think we should match >>>>> it with 0.4.2 first (I need to double check) because IMHO they have >>>>> been adding rather radical changes from 0.5.0, including >>>>> pickle protocol change (by default). >>>>> >>>>> Personally, I would like to match it with the latest because there >>>>> have been some important changes. For >>>>> example, see this too - https://github.com/cloudpipe >>>>> /cloudpickle/pull/138 (it's pending for reviewing yet) eventually but >>>>> 0.4.2 should be >>>>> a good start point. >>>>> >>>>> For the strategy, I think we can match it and follow 0.4.x within >>>>> Spark for the conservative and safe choice + minimal cost. >>>>> >>>>> >>>>> I tried to leave few explicit answers to the questions from you, Bryan: >>>>> >>>>> > Spark is currently using a forked version and it seems like updates >>>>> are made every now and then when >>>>> > needed, but it's not really clear where the current state is and how >>>>> much it has diverged. >>>>> >>>>> I am quite sure our cloudpickle copy is closer to 0.4.1 IIRC. >>>>> >>>>> >>>>> > Are there any known issues with recent changes from those that >>>>> follow cloudpickle dev? >>>>> >>>>> I am technically involved in cloudpickle dev although less active. >>>>> They changed default pickle protocol (https://github.com/cloudpipe/ >>>>> cloudpickle/pull/127). So, if we target 0.5.x+, we should double check >>>>> the potential compatibility issue, or fix the protocol, which I >>>>> believe is introduced from 0.5.x. >>>>> >>>>> >>>>> >>>>> 2018-01-16 11:43 GMT+09:00 Bryan Cutler <[email protected]>: >>>>> >>>>>> Hi All, >>>>>> >>>>>> I've seen a couple issues lately related to cloudpickle, notably >>>>>> https://issues.apache.org/jira/browse/SPARK-22674, and would like to >>>>>> get some feedback on updating the version in PySpark which should fix >>>>>> these >>>>>> issues and allow us to remove some workarounds. Spark is currently >>>>>> using a >>>>>> forked version and it seems like updates are made every now and then when >>>>>> needed, but it's not really clear where the current state is and how much >>>>>> it has diverged. This makes back-porting fixes difficult. There was a >>>>>> previous discussion on moving it to a dependency here >>>>>> <http://apache-spark-developers-list.1001551.n3.nabble.com/PYTHON-DISCUSS-Moving-to-cloudpickle-and-or-Py4J-as-a-dependencies-td20954.html>, >>>>>> but given the status right now I think it would be best to do another >>>>>> update and bring things closer to upstream before we talk about >>>>>> completely >>>>>> moving it outside of Spark. Before starting another update, it might be >>>>>> good to discuss the strategy a little. Should the version in Spark be >>>>>> derived from a release or at least tied to a specific commit? It would >>>>>> also be good if we can document where it has diverged. Are there any >>>>>> known >>>>>> issues with recent changes from those that follow cloudpickle dev? Any >>>>>> other thoughts or concerns? >>>>>> >>>>>> Thanks, >>>>>> Bryan >>>>>> >>>>> >>>>> >>>> >>> >>> >>> -- >>> Twitter: https://twitter.com/holdenkarau >>> >> >> >> > -- Twitter: https://twitter.com/holdenkarau
