I have no doubt there are software developers out there who have spent
years building up their own personal collections of 'interesting' datasets,
file formats, and various oddities that they take with them wherever they
go, and consider this collection to be precious. Despite the fact that many
bad datasets are collected daily at beamlines the world over, it is amazing
how difficult it can be to find what you want when there is no open, single
point-of-access repository to search. Simply asking the crystallographers
and beamline scientists doesn't work: they are too busy doing their own
jobs.

-- David


On 31 October 2011 15:18, Oganesyan, Vaheh <oganesy...@medimmune.com> wrote:

>
> I was hesitant to add my opinion so far because I'm used more to listen
> this forum rather than tell others what I think.
> "Why" and "what" to deposit are absolutely interconnected. Once you decide
> why you want to do it, then you will probably know what will be the best
> format and *vice versa*.
> Whether this deposition of raw images will or will not help in future
> understanding the biology better I'm not sure.
> But to store those difficult datasets to help the future software
> development sounds really farfetched. This assumes that in the future
> crystallographers will never grow crystals that will deliver difficult
> datasets. If that is the case and in 10-20-30 years next generation will be
> growing much better crystals then they don't need such a software
> development.
> If that is not the case, and once in a while (or more often) they will be
> getting something out of ordinary then software developers will take them
> and develop whatever they need to develop to consider such cases.
>
> Am I missing a point of discussion here?
>
> Regards,
>
>       Vaheh
>
>
>
>
> -----Original Message-----
> From: CCP4 bulletin board 
> [mailto:CCP4BB@JISCMAIL.AC.UK<CCP4BB@JISCMAIL.AC.UK>]
> On Behalf Of Robert Esnouf
> Sent: Monday, October 31, 2011 10:31 AM
> To: CCP4BB@JISCMAIL.AC.UK
> Subject: Re: [ccp4bb] To archive or not to archive, that's the question!
>
> Dear All,
>
> As someone who recently left crystallography for sequencing, I
> should modify Tassos's point...
>
> "A full data-set is a few terabytes, but post-processing
> reduces it to sub-Gb size."
>
> My experience from HiSeqs is that this "full" here means the
> base calls - equivalent to the unmerged HKLs - hardly raw
> data. NGS (short-read) sequencing is an imaging technique and
> the images are more like >100TB for a 15-day run on a single
> flow cell. The raw base calls are about 5TB. The compressed,
> mapped data (BAM file, for a human genome, 30x coverage) is
> about 120GB. It is only a variant call file (VCF, difference
> from a stated human reference genome) that is sub-Gb and these
> files are - unsurprisingly - unsuited to detailed statistical
> analysis. Also $1k is a not yet an economic cost...
>
> The DNA information capacity in a single human body dwarfs the
> entire world disk capacity, so storing DNA is a no brainer
> here. Sequencing groups are making very hard-nosed economic
> decisions about what to store - indeed it is a source of
> research in itself - but the scale of the problem is very much
> bigger.
>
> My tuppence ha'penny is that depositing "raw" images along
> with everything else in the PDB is a nice idea but would have
> little impact on science (human/animal/plant health or
> understanding of biology).
>
> 1) If confined to structures in the PDB, the images would just
> be the ones giving the final best data - hence the ones least
> likely to have been problematic. I'd be more interested in
> SFs/maps for looking at ligand-binding etc...
>
> 2) Unless this were done before paper acceptance they would be
> of little use to referees seeking to review important
> structural papers. I'd like to see PDB validation reports
> (which could include automated data processing, perhaps culled
> from synchrotron sites, SFs and/or maps) made available to
> referees in advance of publication. This would be enabled by
> deposition, but could be achieved in other ways.
>
> 3) The datasets of interest to methods developers are unlikely
> to be the ones deposited. They should be in contact with
> synchrotron archives directly. Processing multiple lattices is
> a case in point here.
>
> 4) Remember the "average consumer" of a PDB file is not a
> crystallographer. More likely to be a graduate student in a
> clinical lab. For him/her things like occupancies and B-
> factors are far more serious concerns... I'm not trivializing
> the issue, but importance is always relative. Are there
> "outsiders" on the panel to keep perspective?
>
> Robert
>
>
> --
>
> Dr. Robert Esnouf,
> University Research Lecturer, ex-crystallographer
> and Head of Research Computing,
> Wellcome Trust Centre for Human Genetics,
> Roosevelt Drive, Oxford OX3 7BN, UK
>
> Emails: rob...@strubi.ox.ac.uk   Tel: (+44) - 1865 - 287783
>     and rob...@esnouf.com        Fax: (+44) - 1865 - 287547
>
>
> ---- Original message ----
> >Date: Mon, 31 Oct 2011 11:37:47 +0100
> >From: CCP4 bulletin board <CCP4BB@JISCMAIL.AC.UK> (on behalf
> of Anastassis Perrakis <a.perra...@nki.nl>)
> >Subject: Re: [ccp4bb] To archive or not to archive, that's
> the question!
> >To: CCP4BB@JISCMAIL.AC.UK
> >
> >   Dear all,
> >   The discussion about keeping primary data, and what
> >   level of data can be considered 'primary', has -
> >   rather unsurprisingly - come up also in areas other
> >   than structural biology.
> >   An example is next generation sequencing. A
> >   full-dataset is a few tera bytes, but
> >   post-processing reduces it to sub-Gb size. However,
> >   the post-processed data, as in our case,
> >   have suffered the inadequacy of computational
> >   "reduction" ... At least out institute has decided
> >   to create double back-up of the primary data in
> >   triplicate. For that reason our facility bought
> >   three -80 freezers, one on site at the basement, on
> >   at the top floor, and one off-site, and they keep
> >   the DNA to be sequenced. A sequencing run is already
> >   sub-1k$ and it will not become
> >   more expensive. So, if its important, do it again.
> >   Its cheaper and its better.
> >   At first sight, that does not apply to MX. Or does
> >   it?
> >   So, maybe the question is not "To archive or not to
> >   archive" but "What to archive".
> >   (similarly, it never crossed my mind if I should "be
> >   or not be" - I always wondered "what to be")
> >   A.
> >   On Oct 30, 2011, at 11:59, Kay Diederichs wrote:
> >
> >     Am 20:59, schrieb Jrh:
> >     ...
> >
> >       So:-  Universities are now establishing their
> >       own institutional
> >
> >       repositories, driven largely by Open Access
> >       demands of funders. For
> >
> >       these to host raw datasets that underpin
> >       publications is a reasonable
> >
> >       role in my view and indeed they already have
> >       this category in the
> >
> >       University of Manchester eScholar system, for
> >       example.  I am set to
> >
> >       explore locally here whether they would
> >       accommodate all our Lab's raw
> >
> >       Xray images datasets per annum that underpin our
> >       published crystal
> >
> >       structures.
> >
> >       It would be helpful if readers of this CCP4bb
> >       could kindly also
> >
> >       explore with their own universities if they have
> >       such an
> >
> >       institutional repository and if raw data sets
> >       could be accommodated.
> >
> >       Please do email me off list with this
> >       information if you prefer but
> >
> >       within the CCP4bb is also good.
> >
> >     Dear John,
> >
> >     I'm pretty sure that there exists no consistent
> >     policy to provide an "institutional repository"
> >     for deposition of scientific data at German
> >     universities or Max-Planck institutes or Helmholtz
> >     institutions, at least I never heard of something
> >     like this. More specifically, our University of
> >     Konstanz certainly does not have the
> >     infrastructure to provide this.
> >
> >     I don't think that Germany is the only country
> >     which is the exception to any rule of availability
> >     of "institutional repository" . Rather, I'm almost
> >     amazed that British and American institutions seem
> >     to support this.
> >
> >     Thus I suggest to not focus exclusively on
> >     official institutional repositories, but to
> >     explore alternatives: distributed filestores like
> >     Google's BigTable, Bittorrent or others might be
> >     just as suitable - check out
> >     http://en.wikipedia.org/wiki/Distributed_data_store.
> >     I guess that any crystallographic lab could easily
> >     sacrifice/donate a TB of storage for the purposes
> >     of this project in 2011 (and maybe 2 TB in 2012, 3
> >     in 2013, ...), but clearly the level of work to
> >     set this up should be kept as low as possible (a
> >     bittorrent daemon seems simple enough).
> >
> >     Just my 2 cents,
> >
> >     Kay
> >
> >   P please don't print this e-mail unless you really
> >   need to
> >   Anastassis (Tassos) Perrakis, Principal Investigator
> >   / Staff Member
> >   Department of Biochemistry (B8)
> >   Netherlands Cancer Institute,
> >   Dept. B8, 1066 CX Amsterdam, The Netherlands
> >   Tel: +31 20 512 1951 Fax: +31 20 512 1954 Mobile /
> >   SMS: +31 6 28 597791
>
> To the extent this electronic communication or any of its attachments
> contain information that is not in the public domain, such information is
> considered by MedImmune to be confidential and proprietary. This
> communication is expected to be read and/or used only by the individual(s)
> for whom it is intended. If you have received this electronic communication
> in error, please reply to the sender advising of the error in transmission
> and delete the original message and any accompanying documents from your
> system immediately, without copying, reviewing or otherwise using them for
> any purpose. Thank you for your cooperation.
>

Reply via email to