Hi Simone, I have had a little experience with HDF5 and am interested to see where you go with this. I wonder if you could use "feather": https://github.com/wesm/feather
There was a recent post from Wes McKinney about feather, which sparked my interest: http://wesmckinney.com/blog/high-perf-arrow-to-pandas/ Do you use HDF5 to store intermediates? if so, I would try storing intermediates to a file format like feather and then reducing to a HDF5 file at the end. The reduction should be IO bound and not dependent on RAM so would suit your cluster. If you need to read a large array then I think HDF5 supports that (for single write but multiple reads) without the need for MPI - so this could map well to a tool like distributed: http://distributed.readthedocs.io/en/latest/ Not sure this helps, there is an assumption (on my part) that your intermediate calculations are not terabytes in size. Good luck! Nathan On 29 December 2016 at 05:07, simone codeluppi <simone.codelu...@gmail.com> wrote: > Hi all! > > I would like to pick your brain for some suggestion on how to modify my > image analysis pipeline. > > I am analyzing terabytes of image stacks generated using a microscope. The > current code I generated rely heavily on scikit-image, numpy and scipy. In > order to speed up the analysis the code runs on a HPC computer ( > https://www.nsc.liu.se/systems/triolith/) with MPI (mpi4py) for > parallelization and hdf5 (h5py) for file storage. The development cycle of > the code has been pretty painful mainly due to my non familiarity with mpi > and problems in compiling parallel hdf5 (with many open/closing bugs). > However, the big drawback is that each core has only 2Gb of RAM (no shared > ram across nodes) and in order to run some of the processing steps i ended > up reserving one node (16 cores) but running only 3 cores in order to have > enough ram (image chunking won’t work in this case). As you can imagine > this is extremely inefficient and i end up getting low priority in the > queue system. > > > Our lab currently bought a new 4 nodes server with shared RAM running > hadoop. My goal is to move the parallelization of the processing to dask. I > tested it before in another system and works great. The drawback is that, > if I understood correctly, parallel hdf5 works only with MPI > (driver=’mpio’). Hdf5 gave me quite a bit of headache but works well in > keeping a good structure of the data and i can save everything as numpy > arrays….very handy. > > > If I will move to hadoop/dask what do you think will be a good solution > for data storage? Do you have any additional suggestion that can improve > the layout of the pipeline? Any help will be greatly appreciated. > > > Simone > -- > *Bad as he is, the Devil may be abus'd,* > *Be falsy charg'd, and causelesly accus'd,* > *When men, unwilling to be blam'd alone,* > *Shift off these Crimes on Him which are their* > *Own* > > *Daniel Defoe* > > simone.codelu...@gmail.com > > sim...@codeluppi.org > > > _______________________________________________ > scikit-image mailing list > scikit-image@python.org > https://mail.python.org/mailman/listinfo/scikit-image > >
_______________________________________________ scikit-image mailing list scikit-image@python.org https://mail.python.org/mailman/listinfo/scikit-image