[Scikit-learn-general] Overflow error dumping large GaussianProcess estimator with joblib.

Ralf Gunter Wed, 05 Mar 2014 14:06:06 -0800

Hi folks,

Attempting to dump a trained GP estimator (7000 samples, 16 features)
using joblib (compress = 6) is causing the following error:



Traceback (most recent call last):
  File 
"/home/rgunter/.local/lib/python2.7/site-packages/sklearn/externals/joblib/numpy_pickle.py",
line 241, in save
    obj, filename = self._write_array(obj, filename)
  File 
"/home/rgunter/.local/lib/python2.7/site-packages/sklearn/externals/joblib/numpy_pickle.py",
line 214, in _write_array
    compress=self.compress)
  File 
"/home/rgunter/.local/lib/python2.7/site-packages/sklearn/externals/joblib/numpy_pickle.py",
line 89, in write_zfile
    file_handle.write(zlib.compress(asbytes(data), compress))
OverflowError: size does not fit in an int

Traceback (most recent call last):
  File "learn.py", line 28, in <module>
    joblib.dump(reg, opt.save_model, compress = 6)
  File 
"/home/rgunter/.local/lib/python2.7/site-packages/sklearn/externals/joblib/numpy_pickle.py",
line 367, in dump
    pickler.dump(value)
  File "/home/rgunter/.local/lib/python2.7/pickle.py", line 224, in dump
    self.save(obj)
  File 
"/home/rgunter/.local/lib/python2.7/site-packages/sklearn/externals/joblib/numpy_pickle.py",
line 249, in save
    return Pickler.save(self, obj)
  File "/home/rgunter/.local/lib/python2.7/pickle.py", line 331, in save
    self.save_reduce(obj=obj, *rv)
  File "/home/rgunter/.local/lib/python2.7/pickle.py", line 419, in save_reduce
    save(state)
  File 
"/home/rgunter/.local/lib/python2.7/site-packages/sklearn/externals/joblib/numpy_pickle.py",
line 249, in save
    return Pickler.save(self, obj)
  File "/home/rgunter/.local/lib/python2.7/pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "/home/rgunter/.local/lib/python2.7/pickle.py", line 649, in save_dict
    self._batch_setitems(obj.iteritems())
  File "/home/rgunter/.local/lib/python2.7/pickle.py", line 681, in
_batch_setitems
    save(v)
  File 
"/home/rgunter/.local/lib/python2.7/site-packages/sklearn/externals/joblib/numpy_pickle.py",
line 249, in save
    return Pickler.save(self, obj)
  File "/home/rgunter/.local/lib/python2.7/pickle.py", line 331, in save
    self.save_reduce(obj=obj, *rv)
  File "/home/rgunter/.local/lib/python2.7/pickle.py", line 419, in save_reduce
    save(state)
  File 
"/home/rgunter/.local/lib/python2.7/site-packages/sklearn/externals/joblib/numpy_pickle.py",
line 249, in save
    return Pickler.save(self, obj)
  File "/home/rgunter/.local/lib/python2.7/pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "/home/rgunter/.local/lib/python2.7/pickle.py", line 562, in save_tuple
    save(element)
  File 
"/home/rgunter/.local/lib/python2.7/site-packages/sklearn/externals/joblib/numpy_pickle.py",
line 249, in save
    return Pickler.save(self, obj)
  File "/home/rgunter/.local/lib/python2.7/pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "/home/rgunter/.local/lib/python2.7/pickle.py", line 486, in save_string
    self.write(BINSTRING + pack("<i", n) + obj)
struct.error: 'i' format requires -2147483648 <= number <= 2147483647


The same error occurs when dumping a random (40k x 40k) array to disk
(i.e. > 4GB):

  import numpy as np
  from sklearn.externals import joblib

  w = np.random.random((40000, 40000))
  joblib.dump(w, "test.pkl", compress = 6)

This machine has 64GB of RAM so saving/loading this shouldn't be a
problem. In fact, we'd like to go all the way to 15000 samples if
possible. Unsurprisingly, disabling compression does make the error go
away but also generates huge files.

Python is version 2.7.5, numpy is 1.8.0 and sklearn is mblondel's
kernel ridge branch[1] (i.e. joblib is 0.7.1), but the same happens in
a fresh pull from the mainline (with joblib at 0.8.0a3).

Perhaps the joblib list might be a more appropriate place for this,
but since the object being pickled is from sklearn I thought it would
be best to run this through you first. Does anyone have any experience
with model persistence for such big estimators? How should they be
stored on disk? It seems this may be an explicit limitation from
python[2], but since the same compression level for 5000 samples
already takes ~1.4GB, I'm a bit concerned with how big an uncompressed
version would grow for the sizes we're interested in.

Thanks!

[1] -- https://github.com/mblondel/scikit-learn/tree/kernel_ridge
[2] -- http://bugs.python.org/issue8651

------------------------------------------------------------------------------
Subversion Kills Productivity. Get off Subversion & Make the Move to Perforce.
With Perforce, you get hassle-free workflows. Merge that actually works. 
Faster operations. Version large binaries.  Built-in WAN optimization and the
freedom to use Git, Perforce or both. Make the move to Perforce.
http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

[Scikit-learn-general] Overflow error dumping large GaussianProcess estimator with joblib.

Reply via email to