So far, all I really have is a "proof of concept" compression algorithm here:
http://bl831.als.lbl.gov/~jamesh/lossy_compression/

Not exactly "portable" since you need ffmpeg and the x264 libraries
set up properly.  The latter seems to be constantly changing things
and breaking the former, so I'm not sure how "future proof" my
"algorithm" is.

Something that caught my eye recently was fractal compression,
particularly since FIASCO has been part of the NetPBM package for
about 10 years now.  Seems to give comparable compression vs quality
as x264 (to my eye), but I'm presently wondering if I'd be wasting my
time developing this further?  Will the crystallographic world simply
turn up its collective nose at lossy images?  Even if it means waiting
6 years for "Nielsen's Law" to make up the difference in network
bandwidth?

-James Holton
MAD Scientist

On Mon, Nov 7, 2011 at 10:01 AM, Herbert J. Bernstein
<y...@bernstein-plus-sons.com> wrote:
> This is a very good question.  I would suggest that both versions
> of the old data are useful.  If was is being done is simple validation
> and regeneration of what was done before, then the lossy compression
> should be fine in most instances.  However, when what is being
> done hinges on the really fine details -- looking for lost faint
> spots just peeking out from the background, looking at detailed
> peak profiles -- then the lossless compression version is the
> better choice.  The annotation for both sets should be the same.
> The difference is in storage and network bandwidth.
>
> Hopefully the fraud issue will never again rear its ugly head,
> but if it should, then having saved the losslessly compressed
> images might prove to have been a good idea.
>
> To facilitate experimentation with the idea, if there is agreement
> on the particular lossy compression to be used, I would be happy
> to add it as an option in CBFlib.  Right now all the compressions
> we have are lossless.
>
> Regards,
>  Herbert
>
>
> =====================================================
>  Herbert J. Bernstein, Professor of Computer Science
>   Dowling College, Kramer Science Center, KSC 121
>        Idle Hour Blvd, Oakdale, NY, 11769
>
>                 +1-631-244-3035
>                 y...@dowling.edu
> =====================================================
>
> On Mon, 7 Nov 2011, James Holton wrote:
>
>> At the risk of sounding like another "poll", I have a pragmatic question
>> for the methods development community:
>>
>> Hypothetically, assume that there was a website where you could download
>> the original diffraction images corresponding to any given PDB file,
>> including "early" datasets that were from the same project, but because of
>> smeary spots or whatever, couldn't be solved.  There might even be datasets
>> with "unknown" PDB IDs because that particular project never did work out,
>> or because the relevant protein sequence has been lost.  Remember, few of
>> these datasets will be less than 5 years old if we try to allow enough time
>> for the original data collector to either solve it or graduate (and then
>> cease to care).  Even for the "final" dataset, there will be a delay, since
>> the half-life between data collection and coordinate deposition in the PDB
>> is still ~20 months. Plenty of time to forget.  So, although the images were
>> archived (probably named "test" and in a directory called "john") it may be
>> that the only way to figure out which PDB ID is the "right answer" is by
>> processing them and comparing to all deposited Fs.  Assume this was done.
>>  But there will always be some datasets that don't match any PDB.  Are those
>> interesting?  What about ones that can't be processed?  What about ones that
>> can't even be indexed?  There may be a lot of those!  (hypothetically, of
>> course).
>>
>> Anyway, assume that someone did go through all the trouble to make these
>> datasets "available" for download, just in case they are interesting, and
>> annotated them as much as possible.  There will be about 20 datasets for any
>> given PDB ID.
>>
>> Now assume that for each of these datasets this hypothetical website has
>> two links, one for the "raw data", which will average ~2 GB per wedge (after
>> gzip compression, taking at least ~45 min to download), and a second link
>> for a "lossy compressed" version, which is only ~100 MB/wedge (2 min
>> download). When decompressed, the images will visually look pretty much like
>> the originals, and generally give you very similar Rmerge, Rcryst, Rfree,
>> I/sigma, anomalous differences, and all other statistics when processed with
>> contemporary software.  Perhaps a bit worse.  Essentially, lossy compression
>> is equivalent to adding noise to the images.
>>
>> Which one would you try first?  Does lossy compression make it easier to
>> hunt for "interesting" datasets?  Or is it just too repugnant to have
>> "modified" the data in any way shape or form ... after the detector
>> manufacturer's software has "corrected" it?  Would it suffice to simply
>> supply a couple of "example" images for download instead?
>>
>> -James Holton
>> MAD Scientist
>>
>

Reply via email to