Re: [ccp4bb] image compression

Jan Dohnalek Mon, 07 Nov 2011 23:30:08 -0800

I think that real universal image depositions will not take off without a
newish type of compression that will speed up and ease up things.
Therefore the compression discussion is highly relevant - I would even
suggest to go to mathematicians and software engineers to provide
a highly efficient compression format for our type of data - our data sets
have some very typical repetitive features so they can be very likely
compressed as a whole set without loosing information (differential
compression in the series) but this needs experts ..



Jan Dohnalek


On Tue, Nov 8, 2011 at 8:19 AM, Miguel Ortiz Lombardia <
[email protected]> wrote:

> So the purists of speed seem to be more relevant than the purists of
> images.
>
> We complain all the time about how many errors we have out there in our
> experiments that we seemingly cannot account for. Yet, would we add
> another source?
>
> Sorry if I'm missing something serious here, but I cannot understand
> this artificial debate. You can do useful remote data collection without
> having look at *each* image.
>
>
> Miguel
>
>
> Le 08/11/2011 06:27, Frank von Delft a écrit :
> > I'll second that...  can't remember anybody on the barricades about
> > "corrected" CCD images, but they've been just so much more practical.
> >
> > Different kind of problem, I know, but equivalent situation:  the people
> > to ask are not the purists, but the ones struggling with the huge
> > volumes of data.  I'll take the lossy version any day if it speeds up
> > real-time evaluation of data quality, helps me browse my datasets, and
> > allows me to do remote but intelligent data collection.
> >
> > phx.
> >
> >
> >
> > On 08/11/2011 02:22, Herbert J. Bernstein wrote:
> >> Dear James,
> >>
> >>     You are _not_ wasting your time.  Even if the lossy compression ends
> >> up only being used to stage preliminary images forward on the net while
> >> full images slowly work their way forward, having such a compression
> >> that preserves the crystallography in the image will be an important
> >> contribution to efficient workflows.  Personally I suspect that
> >> such images will have more important, uses, e.g. facilitating
> >> real-time monitoring of experiments using detectors providing
> >> full images at data rates that simply cannot be handled without
> >> major compression.  We are already in that world.  The reason that
> >> the Dectris images use Andy Hammersley's byte-offset compression,
> >> rather than going uncompressed or using CCP4 compression is that
> >> in January 2007 we were sitting right on the edge of a nasty
> >> CPU-performance/disk bandwidth tradeoff, and the byte-offset
> >> compression won the competition.   In that round a lossless
> >> compression was sufficient, but just barely.  In the future,
> >> I am certain some amount of lossy compression will be
> >> needed to sample the dataflow while the losslessly compressed
> >> images work their way through a very back-logged queue to the disk.
> >>
> >>     In the longer term, I can see people working with lossy compressed
> >> images for analysis of massive volumes of images to select the
> >> 1% to 10% that will be useful in a final analysis, and may need
> >> to be used in a lossless mode.  If you can reject 90% of the images
> >> with a fraction of the effort needed to work with the resulting
> >> 10% of good images, you have made a good decision.
> >>
> >>     An then there is the inevitable need to work with images on
> >> portable devices with limited storage over cell and WIFI networks. ...
> >>
> >>     I would not worry about upturned noses.  I would worry about
> >> the engineering needed to manage experiments.  Lossy compression
> >> can be an important part of that engineering.
> >>
> >>     Regards,
> >>       Herbert
> >>
> >>
> >> At 4:09 PM -0800 11/7/11, James Holton wrote:
> >>> So far, all I really have is a "proof of concept" compression
> >>> algorithm here:
> >>> http://bl831.als.lbl.gov/~jamesh/lossy_compression/
> >>>
> >>> Not exactly "portable" since you need ffmpeg and the x264 libraries
> >>> set up properly.  The latter seems to be constantly changing things
> >>> and breaking the former, so I'm not sure how "future proof" my
> >>> "algorithm" is.
> >>>
> >>> Something that caught my eye recently was fractal compression,
> >>> particularly since FIASCO has been part of the NetPBM package for
> >>> about 10 years now.  Seems to give comparable compression vs quality
> >>> as x264 (to my eye), but I'm presently wondering if I'd be wasting my
> >>> time developing this further?  Will the crystallographic world simply
> >>> turn up its collective nose at lossy images?  Even if it means waiting
> >>> 6 years for "Nielsen's Law" to make up the difference in network
> >>> bandwidth?
> >>>
> >>> -James Holton
> >>> MAD Scientist
> >>>
> >>> On Mon, Nov 7, 2011 at 10:01 AM, Herbert J. Bernstein
> >>> <[email protected]>  wrote:
> >>>>   This is a very good question.  I would suggest that both versions
> >>>>   of the old data are useful.  If was is being done is simple
> >>>> validation
> >>>>   and regeneration of what was done before, then the lossy compression
> >>>>   should be fine in most instances.  However, when what is being
> >>>>   done hinges on the really fine details -- looking for lost faint
> >>>>   spots just peeking out from the background, looking at detailed
> >>>>   peak profiles -- then the lossless compression version is the
> >>>>   better choice.  The annotation for both sets should be the same.
> >>>>   The difference is in storage and network bandwidth.
> >>>>
> >>>>   Hopefully the fraud issue will never again rear its ugly head,
> >>>>   but if it should, then having saved the losslessly compressed
> >>>>   images might prove to have been a good idea.
> >>>>
> >>>>   To facilitate experimentation with the idea, if there is agreement
> >>>>   on the particular lossy compression to be used, I would be happy
> >>>>   to add it as an option in CBFlib.  Right now all the compressions
> >>>   >  we have are lossless.
> >>>>   Regards,
> >>>>    Herbert
> >>>>
> >>>>
> >>>>   =====================================================
> >>>>    Herbert J. Bernstein, Professor of Computer Science
> >>>>     Dowling College, Kramer Science Center, KSC 121
> >>>>          Idle Hour Blvd, Oakdale, NY, 11769
> >>>>
> >>>>                   +1-631-244-3035
> >>>>                   [email protected]
> >>>>   =====================================================
> >>>>
> >>>>   On Mon, 7 Nov 2011, James Holton wrote:
> >>>>
> >>>>>   At the risk of sounding like another "poll", I have a pragmatic
> >>>>> question
> >>>>>   for the methods development community:
> >>>>>
> >>>>>   Hypothetically, assume that there was a website where you could
> >>>>> download
> >>>>>   the original diffraction images corresponding to any given PDB
> file,
> >>>>>   including "early" datasets that were from the same project, but
> >>>>> because of
> >>>>>   smeary spots or whatever, couldn't be solved.  There might even
> >>>>> be datasets
> >>>>>   with "unknown" PDB IDs because that particular project never did
> >>>>> work out,
> >>>>>   or because the relevant protein sequence has been lost.
> >>>>> Remember, few of
> >>>>>   these datasets will be less than 5 years old if we try to allow
> >>>>> enough time
> >>>>>   for the original data collector to either solve it or graduate
> >>>>> (and then
> >>>>>   cease to care).  Even for the "final" dataset, there will be a
> >>>>> delay, since
> >>>>>   the half-life between data collection and coordinate deposition
> >>>>> in the PDB
> >>>>>   is still ~20 months. Plenty of time to forget.  So, although the
> >>>>> images were
> >>>>>   archived (probably named "test" and in a directory called "john")
> >>>>> it may be
> >>>>>   that the only way to figure out which PDB ID is the "right
> >>>>> answer" is by
> >>>>>   processing them and comparing to all deposited Fs.  Assume this
> >>>>> was done.
> >>>>>    But there will always be some datasets that don't match any PDB.
> >>>>> Are those
> >>>>>   interesting?  What about ones that can't be processed?  What
> >>>>> about ones that
> >>>>>   can't even be indexed?  There may be a lot of those!
> >>>>> (hypothetically, of
> >>>>>   course).
> >>>>>
> >>>>>   Anyway, assume that someone did go through all the trouble to
> >>>>> make these
> >>>>>   datasets "available" for download, just in case they are
> >>>>> interesting, and
> >>>>>   annotated them as much as possible.  There will be about 20
> >>>>> datasets for any
> >>>>>   given PDB ID.
> >>>>>
> >>>>>   Now assume that for each of these datasets this hypothetical
> >>>>> website has
> >>>>>   two links, one for the "raw data", which will average ~2 GB per
> >>>>> wedge (after
> >>>>>   gzip compression, taking at least ~45 min to download), and a
> >>>>> second link
> >>>>>   for a "lossy compressed" version, which is only ~100 MB/wedge (2
> min
> >>>>>   download). When decompressed, the images will visually look
> >>>>> pretty much like
> >>>>>   the originals, and generally give you very similar Rmerge,
> >>>>> Rcryst, Rfree,
> >>>>>   I/sigma, anomalous differences, and all other statistics when
> >>>>> processed with
> >>>>>   contemporary software.  Perhaps a bit worse.  Essentially, lossy
> >>>>> compression
> >>>>>   is equivalent to adding noise to the images.
> >>>>>
> >>>>>   Which one would you try first?  Does lossy compression make it
> >>>>> easier to
> >>>>>   hunt for "interesting" datasets?  Or is it just too repugnant to
> >>>>> have
> >>>>>   "modified" the data in any way shape or form ... after the detector
> >>>>>   manufacturer's software has "corrected" it?  Would it suffice to
> >>>>> simply
> >>>>>   supply a couple of "example" images for download instead?
> >>>>>
> >>>>>   -James Holton
> >>>>>   MAD Scientist
> >>>>>
> >>
> >
>
>
> --
> Miguel
>
> Architecture et Fonction des Macromolécules Biologiques (UMR6098)
> CNRS, Universités d'Aix-Marseille I & II
> Case 932, 163 Avenue de Luminy, 13288 Marseille cedex 9, France
> Tel: +33(0) 491 82 55 93
> Fax: +33(0) 491 26 67 20
> mailto:[email protected]
> http://www.afmb.univ-mrs.fr/Miguel-Ortiz-Lombardia
>



-- 
Jan Dohnalek, Ph.D
Institute of Macromolecular Chemistry
Academy of Sciences of the Czech Republic
Heyrovskeho nam. 2
16206 Praha 6
Czech Republic

Tel: +420 296 809 390
Fax: +420 296 809 410

Re: [ccp4bb] image compression

Reply via email to