Hello List, I am working with relatively humongous binary files (created via cPickle), and I stumbled across some unexpected (for me) performance differences between two approaches I use to load those files:
1. Simply use cPickle.load(fid) 2. Read the file as binary using file.read() and then use cPickle.loads on the resulting output In the snippet below, the MakePickle function is a dummy function that generates a relatively big binary file with cPickle (WARNING: around 3 GB) in the current directory. I am using NumPy arrays to make the file big but my original data structure is much more complicated, and things like HDF5 or databases are currently not an option - I'd like to stay with pickles. The ReadPickle function simply uses cPickle.load(fid) on the opened binary file, and on my PC it takes about 2.3 seconds (approach 1). The ReadPlusLoads function reads the file using file.read() and then use cPickle.loads on the resulting output (approach 2). On my PC, the file.read() process takes 15 seconds (!!!) and the cPickle.loads only 1.5 seconds. What baffles me is the time it takes to read the file using file.read(): is there any way to slurp it all in one go (somehow) into a string ready for cPickle.loads without that much of an overhead? Note that all of this has been done on Windows 7 64bit with Python 2.7 64bit, with 16 cores and 100 GB RAM (so memory should not be a problem). Thank you in advance for all suggestions :-) . Andrea. # Begin code import os, sys import time import cPickle import numpy class Dummy(object): def __init__(self, name): self.name = name self.data = numpy.random.rand(200, 600, 10) def MakePickle(): num_objects = 300 list_of_objects = [] for index in xrange(num_objects): dummy = Dummy('dummy_%d'%index) list_of_objects.append(dummy) fid = open('dummy.pkl', 'wb') start = time.time() out = cPickle.dumps(list_of_objects, cPickle.HIGHEST_PROTOCOL) end = time.time() print 'cPickle.dumps time:', end-start start = end fid.write(out) end = time.time() print 'file.write time:', end-start fid.close() def ReadPickle(): fid = open('dummy.pkl', 'rb') start = time.time() out = cPickle.load(fid) end = time.time() print 'cPickle.load time:', end-start fid.close() def ReadPlusLoads(): start = time.time() fid = open('dummy.pkl', 'rb') strs = fid.read() fid.close() end = time.time() print 'file.read time:', end-start start = end out = cPickle.loads(strs) end = time.time() print 'cPickle.loads time:', end-start if __name__ == '__main__': ReadPickle() ReadPlusLoads() # End code -- https://mail.python.org/mailman/listinfo/python-list