Something I have been playing with recently that might address your problem in a way you like is SquashFS:
http://squashfs.sourceforge.net/

SquashFS is a read-only compressed file system. It uses gzip --best, which is comparable to bzip2 for diffraction images (in my experience). Basically, it works a lot like burning to a CD. You run "mksquashfs" to create the compressed image and then "mount -o loop" it. Then voila! You can access everything in the archive as if it were an uncompressed file. Disk I/O then consists of compressed data (decompression is done by the kernel), and so does network traffic if you play a clever trick: share the compressed file over NFS and "mount -o loop" it locally. This has much bigger advantages than you might realize because most of the NFS traffic that brings a file server to its knees are the tiny little "writes" that are done to update access times. NFS writes (and RAID writes) are all really expensive, and you can actually gain a considerable performance increase by just mounting your "data" disks read-only (or by putting "noatime" as a mount option).

Anyway, SquashFS is not as slick as the transparent compression you can get with HFS or NTFS, but I personally like the fact that it is read-only (good for data). For real-time backup, mksquashfs does support "appending" to an existing archive, so you can probably build your squashfs file on the usb disk at the beamline (even if the beamline computer kernels can't mount it). However, if you MUST have your processing files mixed amongst your images, you can use "unionfs" to overlay a writable file system with the read-only one. Depends on how cooperative your IT guys are...

-James Holton
MAD Scientist

Ian Tickle wrote:
All -

No doubt this topic has come up before on the BB: I'd like to ask
about the current capabilities of the various integration programs (in
practice we use only MOSFLM & XDS) for reading compressed diffraction
images from synchrotrons.  AFAICS XDS has limited support for reading
compressed images (TIFF format from the MARCCD detector and CCP4
compressed format from the Oxford Diffraction CCD); MOSFLM doesn't
seem to support reading compressed images at all (I'm sure Harry will
correct me if I'm wrong about this!).  I'm really thinking about
gzipped files here: bzip2 no doubt gives marginally smaller files but
is very slow.  Currently we bring back uncompressed images but it
seems to me that this is not the most efficient way of doing things -
or is it just that my expectation that it's more efficient to read
compressed images and uncompress in memory not realised in practice?
For example the AstexViewer molecular viewer software currently reads
gzipped CCP4 maps directly and gunzips them in memory; this improves
the response time by a modest factor of ~ 1.5, but this is because
electron density maps are 'dense' from a compression point of view;
X-ray diffraction images tend to have much more 'empty space' and the
compression factor is usually considerably higher (as much as
10-fold).

On a recent trip we collected more data than we anticipated & the
uncompressed data no longer fitted on our USB disk (the data is backed
up to the USB disk as it's collected), so we would have definitely
benefited from compression!  However file size is *not* the issue:
disk space is cheap after all.  My point is that compressed images
surely require much less disk I/O to read.  In this respect bringing
back compressed images and then uncompressing back to a local disk
completely defeats the object of compression - you actually more than
double the I/O instead of reducing it!  We see this when we try to
process the ~150 datasets that we bring back on our PC cluster and the
disk I/O completely cripples the disk server machine (and everyone
who's trying to use it at the same time!) unless we're careful to
limit the number of simultaneous jobs.  When we routinely start to use
the Pilatus detector on the beamlines this is going to be even more of
an issue.  Basically we have plenty of processing power from the
cluster: the disk I/O is the bottleneck.  Now you could argue that we
should spread the load over more disks or maybe spend more on faster
disk controllers, but the whole point about disks is they're cheap, we
don't need the extra I/O bandwidth for anything else, and you
shouldn't need to spend a fortune, particularly if there are ways of
making the software more efficient, which after all will benefit
everyone.

Cheers

-- Ian

Reply via email to