The point is that science is not collecting stamps. Therefore the first question should always be "Why". If you start with "What" the discussion immediately switches to technical issues like how many TB, PB etc. $/EUR, manpower. And all the intense discussion will blow out by one single "Why". Nothing is for free. But if it would help science and mankind, nobody would hesitate to spend millions of $/EUR.

Supporting software development / software developers is a different question. If this were the first question that someone would have asked the answer would have never been "archiving all datasets worldwide / deposited structures", but how could we, the community, build up a resource with different kind of problems (e.g. space groups, twinning, overlapping lattices, etc.).

I still didn't got an answer for "Why".

Best regards,
Martin



Am 31.10.2011 16:18, schrieb Oganesyan, Vaheh:
I was hesitant to add my opinion so far because I'm used more to listen this forum rather than tell others what I think. "Why" and "what" to deposit are absolutely interconnected. Once you decide why you want to do it, then you will probably know what will be the best format and /vice versa/. Whether this deposition of raw images will or will not help in future understanding the biology better I'm not sure. But to store those difficult datasets to help the future software development sounds really farfetched. This assumes that in the future crystallographers will never grow crystals that will deliver difficult datasets. If that is the case and in 10-20-30 years next generation will be growing much better crystals then they don't need such a software development. If that is not the case, and once in a while (or more often) they will be getting something out of ordinary then software developers will take them and develop whatever they need to develop to consider such cases.
Am I missing a point of discussion here?
Regards,
     Vaheh
-----Original Message-----
From: CCP4 bulletin board [mailto:CCP4BB@JISCMAIL.AC.UK] On Behalf Of Robert Esnouf
Sent: Monday, October 31, 2011 10:31 AM
To: CCP4BB@JISCMAIL.AC.UK
Subject: Re: [ccp4bb] To archive or not to archive, that's the question!
Dear All,
As someone who recently left crystallography for sequencing, I
should modify Tassos's point...
"A full data-set is a few terabytes, but post-processing
reduces it to sub-Gb size."
My experience from HiSeqs is that this "full" here means the
base calls - equivalent to the unmerged HKLs - hardly raw
data. NGS (short-read) sequencing is an imaging technique and
the images are more like >100TB for a 15-day run on a single
flow cell. The raw base calls are about 5TB. The compressed,
mapped data (BAM file, for a human genome, 30x coverage) is
about 120GB. It is only a variant call file (VCF, difference
from a stated human reference genome) that is sub-Gb and these
files are - unsurprisingly - unsuited to detailed statistical
analysis. Also $1k is a not yet an economic cost...
The DNA information capacity in a single human body dwarfs the
entire world disk capacity, so storing DNA is a no brainer
here. Sequencing groups are making very hard-nosed economic
decisions about what to store - indeed it is a source of
research in itself - but the scale of the problem is very much
bigger.
My tuppence ha'penny is that depositing "raw" images along
with everything else in the PDB is a nice idea but would have
little impact on science (human/animal/plant health or
understanding of biology).
1) If confined to structures in the PDB, the images would just
be the ones giving the final best data - hence the ones least
likely to have been problematic. I'd be more interested in
SFs/maps for looking at ligand-binding etc...
2) Unless this were done before paper acceptance they would be
of little use to referees seeking to review important
structural papers. I'd like to see PDB validation reports
(which could include automated data processing, perhaps culled
from synchrotron sites, SFs and/or maps) made available to
referees in advance of publication. This would be enabled by
deposition, but could be achieved in other ways.
3) The datasets of interest to methods developers are unlikely
to be the ones deposited. They should be in contact with
synchrotron archives directly. Processing multiple lattices is
a case in point here.
4) Remember the "average consumer" of a PDB file is not a
crystallographer. More likely to be a graduate student in a
clinical lab. For him/her things like occupancies and B-
factors are far more serious concerns... I'm not trivializing
the issue, but importance is always relative. Are there
"outsiders" on the panel to keep perspective?
Robert
--
Dr. Robert Esnouf,
University Research Lecturer, ex-crystallographer
and Head of Research Computing,
Wellcome Trust Centre for Human Genetics,
Roosevelt Drive, Oxford OX3 7BN, UK
Emails: rob...@strubi.ox.ac.uk   Tel: (+44) - 1865 - 287783
    and rob...@esnouf.com        Fax: (+44) - 1865 - 287547
---- Original message ----
>Date: Mon, 31 Oct 2011 11:37:47 +0100
>From: CCP4 bulletin board <CCP4BB@JISCMAIL.AC.UK> (on behalf
of Anastassis Perrakis <a.perra...@nki.nl>)
>Subject: Re: [ccp4bb] To archive or not to archive, that's
the question!
>To: CCP4BB@JISCMAIL.AC.UK
>
>   Dear all,
>   The discussion about keeping primary data, and what
>   level of data can be considered 'primary', has -
>   rather unsurprisingly - come up also in areas other
>   than structural biology.
>   An example is next generation sequencing. A
>   full-dataset is a few tera bytes, but
>   post-processing reduces it to sub-Gb size. However,
>   the post-processed data, as in our case,
>   have suffered the inadequacy of computational
>   "reduction" ... At least out institute has decided
>   to create double back-up of the primary data in
>   triplicate. For that reason our facility bought
>   three -80 freezers, one on site at the basement, on
>   at the top floor, and one off-site, and they keep
>   the DNA to be sequenced. A sequencing run is already
>   sub-1k$ and it will not become
>   more expensive. So, if its important, do it again.
>   Its cheaper and its better.
>   At first sight, that does not apply to MX. Or does
>   it?
>   So, maybe the question is not "To archive or not to
>   archive" but "What to archive".
>   (similarly, it never crossed my mind if I should "be
>   or not be" - I always wondered "what to be")
>   A.
>   On Oct 30, 2011, at 11:59, Kay Diederichs wrote:
>
>     Am 20:59, schrieb Jrh:
>     ...
>
>       So:-  Universities are now establishing their
>       own institutional
>
>       repositories, driven largely by Open Access
>       demands of funders. For
>
>       these to host raw datasets that underpin
>       publications is a reasonable
>
>       role in my view and indeed they already have
>       this category in the
>
>       University of Manchester eScholar system, for
>       example.  I am set to
>
>       explore locally here whether they would
>       accommodate all our Lab's raw
>
>       Xray images datasets per annum that underpin our
>       published crystal
>
>       structures.
>
>       It would be helpful if readers of this CCP4bb
>       could kindly also
>
>       explore with their own universities if they have
>       such an
>
>       institutional repository and if raw data sets
>       could be accommodated.
>
>       Please do email me off list with this
>       information if you prefer but
>
>       within the CCP4bb is also good.
>
>     Dear John,
>
>     I'm pretty sure that there exists no consistent
>     policy to provide an "institutional repository"
>     for deposition of scientific data at German
>     universities or Max-Planck institutes or Helmholtz
>     institutions, at least I never heard of something
>     like this. More specifically, our University of
>     Konstanz certainly does not have the
>     infrastructure to provide this.
>
>     I don't think that Germany is the only country
>     which is the exception to any rule of availability
>     of "institutional repository" . Rather, I'm almost
>     amazed that British and American institutions seem
>     to support this.
>
>     Thus I suggest to not focus exclusively on
>     official institutional repositories, but to
>     explore alternatives: distributed filestores like
>     Google's BigTable, Bittorrent or others might be
>     just as suitable - check out
> http://en.wikipedia.org/wiki/Distributed_data_store.
>     I guess that any crystallographic lab could easily
>     sacrifice/donate a TB of storage for the purposes
>     of this project in 2011 (and maybe 2 TB in 2012, 3
>     in 2013, ...), but clearly the level of work to
>     set this up should be kept as low as possible (a
>     bittorrent daemon seems simple enough).
>
>     Just my 2 cents,
>
>     Kay
>
>   P please don't print this e-mail unless you really
>   need to
>   Anastassis (Tassos) Perrakis, Principal Investigator
>   / Staff Member
>   Department of Biochemistry (B8)
>   Netherlands Cancer Institute,
>   Dept. B8, 1066 CX Amsterdam, The Netherlands
>   Tel: +31 20 512 1951 Fax: +31 20 512 1954 Mobile /
>   SMS: +31 6 28 597791
To the extent this electronic communication or any of its attachments contain information that is not in the public domain, such information is considered by MedImmune to be confidential and proprietary. This communication is expected to be read and/or used only by the individual(s) for whom it is intended. If you have received this electronic communication in error, please reply to the sender advising of the error in transmission and delete the original message and any accompanying documents from your system immediately, without copying, reviewing or otherwise using them for any purpose. Thank you for your cooperation.

Reply via email to