Hi list, This message is not terribly informative. I just to share my current successes with joblib compression.
I am a bit frustrated at the fact that the LFW cache takes 400M on my disk, for something that I never used. The disk space in the LFW cache is made of two major contributors: * The 'lfw_funneled' directory, with the jpeg images: 289M * The joblib directory, used to store precomputed extraction of the images: 197M I spent quite a while trying to play trick in the code to load and compress intermediate data structures, in order to avoid having thousands of jpeg stored in the lfw_funneled directory for no good reasons, but couldn't really find a good compromise. The best I can get to, it to use tar followed by bzip2, with gets me down to 231M. I tried my current development version of joblib, with compression activated. This brings down the size of the joblib directory to 79M. Of course there is a price to pay in speed: * With compressed joblib: In [2]: %timeit d = datasets.fetch_lfw_people() 1 loops, best of 3: 2.49 s per loop In [3]: %timeit d = datasets.fetch_lfw_pair() 1 loops, best of 3: 822 ms per loop * With joblib and no compression: In [2]: %timeit d = datasets.fetch_lfw_people() 100 loops, best of 3: 2.64 ms per loop In [3]: %timeit d = datasets.fetch_lfw_pairs() 100 loops, best of 3: 3.44 ms per loop * Without joblib caching: In [2]: %timeit d = datasets.fetch_lfw_people() 1 loops, best of 3: 84.9 s per loop In [3]: %timeit d = datasets.fetch_lfw_pairs() 1 loops, best of 3: 26.1 s per loop I think that the new joblib has a useful compression/speed tradeoff :) I need to iron it a bit more, release it, and we can systematically use it in the dataset loaders (note that it will not beat domain-specific compressed data standards, e.g. for images or music). Gael ------------------------------------------------------------------------------ Write once. Port to many. Get the SDK and tools to simplify cross-platform app development. Create new or port existing apps to sell to consumers worldwide. Explore the Intel AppUpSM program developer opportunity. appdeveloper.intel.com/join http://p.sf.net/sfu/intel-appdev _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
