Hi folks, Attempting to dump a trained GP estimator (7000 samples, 16 features) using joblib (compress = 6) is causing the following error:
Traceback (most recent call last): File "/home/rgunter/.local/lib/python2.7/site-packages/sklearn/externals/joblib/numpy_pickle.py", line 241, in save obj, filename = self._write_array(obj, filename) File "/home/rgunter/.local/lib/python2.7/site-packages/sklearn/externals/joblib/numpy_pickle.py", line 214, in _write_array compress=self.compress) File "/home/rgunter/.local/lib/python2.7/site-packages/sklearn/externals/joblib/numpy_pickle.py", line 89, in write_zfile file_handle.write(zlib.compress(asbytes(data), compress)) OverflowError: size does not fit in an int Traceback (most recent call last): File "learn.py", line 28, in <module> joblib.dump(reg, opt.save_model, compress = 6) File "/home/rgunter/.local/lib/python2.7/site-packages/sklearn/externals/joblib/numpy_pickle.py", line 367, in dump pickler.dump(value) File "/home/rgunter/.local/lib/python2.7/pickle.py", line 224, in dump self.save(obj) File "/home/rgunter/.local/lib/python2.7/site-packages/sklearn/externals/joblib/numpy_pickle.py", line 249, in save return Pickler.save(self, obj) File "/home/rgunter/.local/lib/python2.7/pickle.py", line 331, in save self.save_reduce(obj=obj, *rv) File "/home/rgunter/.local/lib/python2.7/pickle.py", line 419, in save_reduce save(state) File "/home/rgunter/.local/lib/python2.7/site-packages/sklearn/externals/joblib/numpy_pickle.py", line 249, in save return Pickler.save(self, obj) File "/home/rgunter/.local/lib/python2.7/pickle.py", line 286, in save f(self, obj) # Call unbound method with explicit self File "/home/rgunter/.local/lib/python2.7/pickle.py", line 649, in save_dict self._batch_setitems(obj.iteritems()) File "/home/rgunter/.local/lib/python2.7/pickle.py", line 681, in _batch_setitems save(v) File "/home/rgunter/.local/lib/python2.7/site-packages/sklearn/externals/joblib/numpy_pickle.py", line 249, in save return Pickler.save(self, obj) File "/home/rgunter/.local/lib/python2.7/pickle.py", line 331, in save self.save_reduce(obj=obj, *rv) File "/home/rgunter/.local/lib/python2.7/pickle.py", line 419, in save_reduce save(state) File "/home/rgunter/.local/lib/python2.7/site-packages/sklearn/externals/joblib/numpy_pickle.py", line 249, in save return Pickler.save(self, obj) File "/home/rgunter/.local/lib/python2.7/pickle.py", line 286, in save f(self, obj) # Call unbound method with explicit self File "/home/rgunter/.local/lib/python2.7/pickle.py", line 562, in save_tuple save(element) File "/home/rgunter/.local/lib/python2.7/site-packages/sklearn/externals/joblib/numpy_pickle.py", line 249, in save return Pickler.save(self, obj) File "/home/rgunter/.local/lib/python2.7/pickle.py", line 286, in save f(self, obj) # Call unbound method with explicit self File "/home/rgunter/.local/lib/python2.7/pickle.py", line 486, in save_string self.write(BINSTRING + pack("<i", n) + obj) struct.error: 'i' format requires -2147483648 <= number <= 2147483647 The same error occurs when dumping a random (40k x 40k) array to disk (i.e. > 4GB): import numpy as np from sklearn.externals import joblib w = np.random.random((40000, 40000)) joblib.dump(w, "test.pkl", compress = 6) This machine has 64GB of RAM so saving/loading this shouldn't be a problem. In fact, we'd like to go all the way to 15000 samples if possible. Unsurprisingly, disabling compression does make the error go away but also generates huge files. Python is version 2.7.5, numpy is 1.8.0 and sklearn is mblondel's kernel ridge branch[1] (i.e. joblib is 0.7.1), but the same happens in a fresh pull from the mainline (with joblib at 0.8.0a3). Perhaps the joblib list might be a more appropriate place for this, but since the object being pickled is from sklearn I thought it would be best to run this through you first. Does anyone have any experience with model persistence for such big estimators? How should they be stored on disk? It seems this may be an explicit limitation from python[2], but since the same compression level for 5000 samples already takes ~1.4GB, I'm a bit concerned with how big an uncompressed version would grow for the sizes we're interested in. Thanks! [1] -- https://github.com/mblondel/scikit-learn/tree/kernel_ridge [2] -- http://bugs.python.org/issue8651 ------------------------------------------------------------------------------ Subversion Kills Productivity. Get off Subversion & Make the Move to Perforce. With Perforce, you get hassle-free workflows. Merge that actually works. Faster operations. Version large binaries. Built-in WAN optimization and the freedom to use Git, Perforce or both. Make the move to Perforce. http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general