Hi Simone, unfortunately I can not contribute anything to your questions but would you mind to share some of your experience with the hdf5 file format and parallelization? I am also analysing a lot of big microscopy images and would love to hear your opinion or get advice for an image pipeline.
Best regards, Stefanie 2016-12-28 23:37 GMT+01:00 Nathan Faggian <nathan.fagg...@gmail.com>: > Hi Simone, > > I have had a little experience with HDF5 and am interested to see where > you go with this. I wonder if you could use "feather": > https://github.com/wesm/feather > > There was a recent post from Wes McKinney about feather, which sparked my > interest: > http://wesmckinney.com/blog/high-perf-arrow-to-pandas/ > > Do you use HDF5 to store intermediates? if so, I would try storing > intermediates to a file format like feather and then reducing to a HDF5 > file at the end. The reduction should be IO bound and not dependent on RAM > so would suit your cluster. > > If you need to read a large array then I think HDF5 supports that (for > single write but multiple reads) without the need for MPI - so this could > map well to a tool like distributed: > http://distributed.readthedocs.io/en/latest/ > > Not sure this helps, there is an assumption (on my part) that your > intermediate calculations are not terabytes in size. > > Good luck! > > Nathan > > > On 29 December 2016 at 05:07, simone codeluppi <simone.codelu...@gmail.com > > wrote: > >> Hi all! >> >> I would like to pick your brain for some suggestion on how to modify my >> image analysis pipeline. >> >> I am analyzing terabytes of image stacks generated using a microscope. >> The current code I generated rely heavily on scikit-image, numpy and scipy. >> In order to speed up the analysis the code runs on a HPC computer ( >> https://www.nsc.liu.se/systems/triolith/) with MPI (mpi4py) for >> parallelization and hdf5 (h5py) for file storage. The development cycle of >> the code has been pretty painful mainly due to my non familiarity with mpi >> and problems in compiling parallel hdf5 (with many open/closing bugs). >> However, the big drawback is that each core has only 2Gb of RAM (no shared >> ram across nodes) and in order to run some of the processing steps i ended >> up reserving one node (16 cores) but running only 3 cores in order to have >> enough ram (image chunking won’t work in this case). As you can imagine >> this is extremely inefficient and i end up getting low priority in the >> queue system. >> >> >> Our lab currently bought a new 4 nodes server with shared RAM running >> hadoop. My goal is to move the parallelization of the processing to dask. I >> tested it before in another system and works great. The drawback is that, >> if I understood correctly, parallel hdf5 works only with MPI >> (driver=’mpio’). Hdf5 gave me quite a bit of headache but works well in >> keeping a good structure of the data and i can save everything as numpy >> arrays….very handy. >> >> >> If I will move to hadoop/dask what do you think will be a good solution >> for data storage? Do you have any additional suggestion that can improve >> the layout of the pipeline? Any help will be greatly appreciated. >> >> >> Simone >> -- >> *Bad as he is, the Devil may be abus'd,* >> *Be falsy charg'd, and causelesly accus'd,* >> *When men, unwilling to be blam'd alone,* >> *Shift off these Crimes on Him which are their* >> *Own* >> >> *Daniel Defoe* >> >> simone.codelu...@gmail.com >> >> sim...@codeluppi.org >> >> >> _______________________________________________ >> scikit-image mailing list >> scikit-image@python.org >> https://mail.python.org/mailman/listinfo/scikit-image >> >> > > _______________________________________________ > scikit-image mailing list > scikit-image@python.org > https://mail.python.org/mailman/listinfo/scikit-image > >
_______________________________________________ scikit-image mailing list scikit-image@python.org https://mail.python.org/mailman/listinfo/scikit-image