Re: [ccp4bb] Processing compressed diffraction images?

James Holton Thu, 06 May 2010 08:23:55 -0700

Something I have been playing with recently that might address yourproblem in a way you like is SquashFS:

http://squashfs.sourceforge.net/

SquashFS is a read-only compressed file system. It uses gzip --best,which is comparable to bzip2 for diffraction images (in my experience).Basically, it works a lot like burning to a CD. You run "mksquashfs" tocreate the compressed image and then "mount -o loop" it. Then voila!You can access everything in the archive as if it were an uncompressedfile. Disk I/O then consists of compressed data (decompression is doneby the kernel), and so does network traffic if you play a clever trick:share the compressed file over NFS and "mount -o loop" it locally. Thishas much bigger advantages than you might realize because most of theNFS traffic that brings a file server to its knees are the tiny little"writes" that are done to update access times. NFS writes (and RAIDwrites) are all really expensive, and you can actually gain aconsiderable performance increase by just mounting your "data" disksread-only (or by putting "noatime" as a mount option).

Anyway, SquashFS is not as slick as the transparent compression you canget with HFS or NTFS, but I personally like the fact that it isread-only (good for data). For real-time backup, mksquashfs doessupport "appending" to an existing archive, so you can probably buildyour squashfs file on the usb disk at the beamline (even if the beamlinecomputer kernels can't mount it). However, if you MUST have yourprocessing files mixed amongst your images, you can use "unionfs" tooverlay a writable file system with the read-only one. Depends on howcooperative your IT guys are...


-James Holton
MAD Scientist

Ian Tickle wrote:

All -

No doubt this topic has come up before on the BB: I'd like to ask
about the current capabilities of the various integration programs (in
practice we use only MOSFLM & XDS) for reading compressed diffraction
images from synchrotrons.  AFAICS XDS has limited support for reading
compressed images (TIFF format from the MARCCD detector and CCP4
compressed format from the Oxford Diffraction CCD); MOSFLM doesn't
seem to support reading compressed images at all (I'm sure Harry will
correct me if I'm wrong about this!).  I'm really thinking about
gzipped files here: bzip2 no doubt gives marginally smaller files but
is very slow.  Currently we bring back uncompressed images but it
seems to me that this is not the most efficient way of doing things -
or is it just that my expectation that it's more efficient to read
compressed images and uncompress in memory not realised in practice?
For example the AstexViewer molecular viewer software currently reads
gzipped CCP4 maps directly and gunzips them in memory; this improves
the response time by a modest factor of ~ 1.5, but this is because
electron density maps are 'dense' from a compression point of view;
X-ray diffraction images tend to have much more 'empty space' and the
compression factor is usually considerably higher (as much as
10-fold).

On a recent trip we collected more data than we anticipated & the
uncompressed data no longer fitted on our USB disk (the data is backed
up to the USB disk as it's collected), so we would have definitely
benefited from compression!  However file size is *not* the issue:
disk space is cheap after all.  My point is that compressed images
surely require much less disk I/O to read.  In this respect bringing
back compressed images and then uncompressing back to a local disk
completely defeats the object of compression - you actually more than
double the I/O instead of reducing it!  We see this when we try to
process the ~150 datasets that we bring back on our PC cluster and the
disk I/O completely cripples the disk server machine (and everyone
who's trying to use it at the same time!) unless we're careful to
limit the number of simultaneous jobs.  When we routinely start to use
the Pilatus detector on the beamlines this is going to be even more of
an issue.  Basically we have plenty of processing power from the
cluster: the disk I/O is the bottleneck.  Now you could argue that we
should spread the load over more disks or maybe spend more on faster
disk controllers, but the whole point about disks is they're cheap, we
don't need the extra I/O bandwidth for anything else, and you
shouldn't need to spend a fortune, particularly if there are ways of
making the software more efficient, which after all will benefit
everyone.

Cheers

-- Ian

Re: [ccp4bb] Processing compressed diffraction images?

Reply via email to