I think that real universal image depositions will not take off without a newish type of compression that will speed up and ease up things. Therefore the compression discussion is highly relevant - I would even suggest to go to mathematicians and software engineers to provide a highly efficient compression format for our type of data - our data sets have some very typical repetitive features so they can be very likely compressed as a whole set without loosing information (differential compression in the series) but this needs experts ..
Jan Dohnalek On Tue, Nov 8, 2011 at 8:19 AM, Miguel Ortiz Lombardia < [email protected]> wrote: > So the purists of speed seem to be more relevant than the purists of > images. > > We complain all the time about how many errors we have out there in our > experiments that we seemingly cannot account for. Yet, would we add > another source? > > Sorry if I'm missing something serious here, but I cannot understand > this artificial debate. You can do useful remote data collection without > having look at *each* image. > > > Miguel > > > Le 08/11/2011 06:27, Frank von Delft a écrit : > > I'll second that... can't remember anybody on the barricades about > > "corrected" CCD images, but they've been just so much more practical. > > > > Different kind of problem, I know, but equivalent situation: the people > > to ask are not the purists, but the ones struggling with the huge > > volumes of data. I'll take the lossy version any day if it speeds up > > real-time evaluation of data quality, helps me browse my datasets, and > > allows me to do remote but intelligent data collection. > > > > phx. > > > > > > > > On 08/11/2011 02:22, Herbert J. Bernstein wrote: > >> Dear James, > >> > >> You are _not_ wasting your time. Even if the lossy compression ends > >> up only being used to stage preliminary images forward on the net while > >> full images slowly work their way forward, having such a compression > >> that preserves the crystallography in the image will be an important > >> contribution to efficient workflows. Personally I suspect that > >> such images will have more important, uses, e.g. facilitating > >> real-time monitoring of experiments using detectors providing > >> full images at data rates that simply cannot be handled without > >> major compression. We are already in that world. The reason that > >> the Dectris images use Andy Hammersley's byte-offset compression, > >> rather than going uncompressed or using CCP4 compression is that > >> in January 2007 we were sitting right on the edge of a nasty > >> CPU-performance/disk bandwidth tradeoff, and the byte-offset > >> compression won the competition. In that round a lossless > >> compression was sufficient, but just barely. In the future, > >> I am certain some amount of lossy compression will be > >> needed to sample the dataflow while the losslessly compressed > >> images work their way through a very back-logged queue to the disk. > >> > >> In the longer term, I can see people working with lossy compressed > >> images for analysis of massive volumes of images to select the > >> 1% to 10% that will be useful in a final analysis, and may need > >> to be used in a lossless mode. If you can reject 90% of the images > >> with a fraction of the effort needed to work with the resulting > >> 10% of good images, you have made a good decision. > >> > >> An then there is the inevitable need to work with images on > >> portable devices with limited storage over cell and WIFI networks. ... > >> > >> I would not worry about upturned noses. I would worry about > >> the engineering needed to manage experiments. Lossy compression > >> can be an important part of that engineering. > >> > >> Regards, > >> Herbert > >> > >> > >> At 4:09 PM -0800 11/7/11, James Holton wrote: > >>> So far, all I really have is a "proof of concept" compression > >>> algorithm here: > >>> http://bl831.als.lbl.gov/~jamesh/lossy_compression/ > >>> > >>> Not exactly "portable" since you need ffmpeg and the x264 libraries > >>> set up properly. The latter seems to be constantly changing things > >>> and breaking the former, so I'm not sure how "future proof" my > >>> "algorithm" is. > >>> > >>> Something that caught my eye recently was fractal compression, > >>> particularly since FIASCO has been part of the NetPBM package for > >>> about 10 years now. Seems to give comparable compression vs quality > >>> as x264 (to my eye), but I'm presently wondering if I'd be wasting my > >>> time developing this further? Will the crystallographic world simply > >>> turn up its collective nose at lossy images? Even if it means waiting > >>> 6 years for "Nielsen's Law" to make up the difference in network > >>> bandwidth? > >>> > >>> -James Holton > >>> MAD Scientist > >>> > >>> On Mon, Nov 7, 2011 at 10:01 AM, Herbert J. Bernstein > >>> <[email protected]> wrote: > >>>> This is a very good question. I would suggest that both versions > >>>> of the old data are useful. If was is being done is simple > >>>> validation > >>>> and regeneration of what was done before, then the lossy compression > >>>> should be fine in most instances. However, when what is being > >>>> done hinges on the really fine details -- looking for lost faint > >>>> spots just peeking out from the background, looking at detailed > >>>> peak profiles -- then the lossless compression version is the > >>>> better choice. The annotation for both sets should be the same. > >>>> The difference is in storage and network bandwidth. > >>>> > >>>> Hopefully the fraud issue will never again rear its ugly head, > >>>> but if it should, then having saved the losslessly compressed > >>>> images might prove to have been a good idea. > >>>> > >>>> To facilitate experimentation with the idea, if there is agreement > >>>> on the particular lossy compression to be used, I would be happy > >>>> to add it as an option in CBFlib. Right now all the compressions > >>> > we have are lossless. > >>>> Regards, > >>>> Herbert > >>>> > >>>> > >>>> ===================================================== > >>>> Herbert J. Bernstein, Professor of Computer Science > >>>> Dowling College, Kramer Science Center, KSC 121 > >>>> Idle Hour Blvd, Oakdale, NY, 11769 > >>>> > >>>> +1-631-244-3035 > >>>> [email protected] > >>>> ===================================================== > >>>> > >>>> On Mon, 7 Nov 2011, James Holton wrote: > >>>> > >>>>> At the risk of sounding like another "poll", I have a pragmatic > >>>>> question > >>>>> for the methods development community: > >>>>> > >>>>> Hypothetically, assume that there was a website where you could > >>>>> download > >>>>> the original diffraction images corresponding to any given PDB > file, > >>>>> including "early" datasets that were from the same project, but > >>>>> because of > >>>>> smeary spots or whatever, couldn't be solved. There might even > >>>>> be datasets > >>>>> with "unknown" PDB IDs because that particular project never did > >>>>> work out, > >>>>> or because the relevant protein sequence has been lost. > >>>>> Remember, few of > >>>>> these datasets will be less than 5 years old if we try to allow > >>>>> enough time > >>>>> for the original data collector to either solve it or graduate > >>>>> (and then > >>>>> cease to care). Even for the "final" dataset, there will be a > >>>>> delay, since > >>>>> the half-life between data collection and coordinate deposition > >>>>> in the PDB > >>>>> is still ~20 months. Plenty of time to forget. So, although the > >>>>> images were > >>>>> archived (probably named "test" and in a directory called "john") > >>>>> it may be > >>>>> that the only way to figure out which PDB ID is the "right > >>>>> answer" is by > >>>>> processing them and comparing to all deposited Fs. Assume this > >>>>> was done. > >>>>> But there will always be some datasets that don't match any PDB. > >>>>> Are those > >>>>> interesting? What about ones that can't be processed? What > >>>>> about ones that > >>>>> can't even be indexed? There may be a lot of those! > >>>>> (hypothetically, of > >>>>> course). > >>>>> > >>>>> Anyway, assume that someone did go through all the trouble to > >>>>> make these > >>>>> datasets "available" for download, just in case they are > >>>>> interesting, and > >>>>> annotated them as much as possible. There will be about 20 > >>>>> datasets for any > >>>>> given PDB ID. > >>>>> > >>>>> Now assume that for each of these datasets this hypothetical > >>>>> website has > >>>>> two links, one for the "raw data", which will average ~2 GB per > >>>>> wedge (after > >>>>> gzip compression, taking at least ~45 min to download), and a > >>>>> second link > >>>>> for a "lossy compressed" version, which is only ~100 MB/wedge (2 > min > >>>>> download). When decompressed, the images will visually look > >>>>> pretty much like > >>>>> the originals, and generally give you very similar Rmerge, > >>>>> Rcryst, Rfree, > >>>>> I/sigma, anomalous differences, and all other statistics when > >>>>> processed with > >>>>> contemporary software. Perhaps a bit worse. Essentially, lossy > >>>>> compression > >>>>> is equivalent to adding noise to the images. > >>>>> > >>>>> Which one would you try first? Does lossy compression make it > >>>>> easier to > >>>>> hunt for "interesting" datasets? Or is it just too repugnant to > >>>>> have > >>>>> "modified" the data in any way shape or form ... after the detector > >>>>> manufacturer's software has "corrected" it? Would it suffice to > >>>>> simply > >>>>> supply a couple of "example" images for download instead? > >>>>> > >>>>> -James Holton > >>>>> MAD Scientist > >>>>> > >> > > > > > -- > Miguel > > Architecture et Fonction des Macromolécules Biologiques (UMR6098) > CNRS, Universités d'Aix-Marseille I & II > Case 932, 163 Avenue de Luminy, 13288 Marseille cedex 9, France > Tel: +33(0) 491 82 55 93 > Fax: +33(0) 491 26 67 20 > mailto:[email protected] > http://www.afmb.univ-mrs.fr/Miguel-Ortiz-Lombardia > -- Jan Dohnalek, Ph.D Institute of Macromolecular Chemistry Academy of Sciences of the Czech Republic Heyrovskeho nam. 2 16206 Praha 6 Czech Republic Tel: +420 296 809 390 Fax: +420 296 809 410
