Tobias -

Simply put, this is an excellent post and discussion.  It has been one of
our foci to allow for open data exchange much along the lines of Creative
Commons with regard to the spectral data that we are building up, and are
engaging efforts to work on the legal issues found within the ownership of
the spectral data.

If you are interested in learning more about our efforts, please send an
email to [EMAIL PROTECTED]

Sanford

On 6/1/07, Tobias Kind <[EMAIL PROTECTED]> wrote:

I think as open data advocates we should re-think our strategies towards
open data spectral collections including NMR and MS and IR and crystal
structure data or chemical property data. I will post this to the Blue
Obelisk mailing list to obtain more input. There were some interesting
discussions in the BlogOSphere, but I am still stuck in Web 1.0 so I need
to
post it to the BO mailing list. I will put this comment later on the new
BlueObelisk wiki and collect discussions (Via votes or comments,
Technology?) and we can compile a list of chemistry journals on
BlueObelisk
with comments on their data sharing policy on chemistry data. Additionally
editors and the editorial boards will be contacted. Chemistry journal
list:

http://www.cas.org/expertise/cascontent/caplus/corejournals.html
http://www.nlm.nih.gov/bsd/journals/subjects.html

Some recent BlogLinks (2007):

http://www.sennoma.net/main/archives/2006/12/where_are_the_data_can_i_have.p
hp

http://researchremix.wordpress.com/2007/05/30/diverse-journal-requirements-f
or-data-sharing/

http://researchremix.wordpress.com/2007/05/30/diverse-journal-requirements-f
or-data-sharing/
http://wwmm.ch.cam.ac.uk/blogs/murrayrust/?p=350

****************************************************************
The paradigm shift in data sharing in chemistry and the Blue Obelisk
movement

Tobias Kind - fiehnlab.ucdavis.edu

There are some historic reasons that chemists depend on commercial
databases
and commercial spectral collections. These sources are usually very
reliable
and curated resources which are either subscription based or one time fee
based. However chemists still publish in the same manner as 100 years ago.
Obstacles which culminate in hindrance of science (literally think of it
as
the bible only published in Latin) are:

O1) Spectral databases or experimental molecular property data are license
protected in a way no derivate work can be obtained in an easy way on
large
data sets. This would be the case if from a one million spectral
collection
only one spectrum at a time can be investigated (instead of bulk access)
or
such an approach is forbidden in the EULA or license.

O2) Subscription prices to data collections are to high (I am not talking
about hundred dollars, but ten-thousands of dollars for a modern spectral
or
cheminformatics lab.)

O3) A combination of both above reasons which ultimatively leads to a
dead-lock in new research synergies even for well equipped academic labs
and
especially for smaller companies.

O4) Many scattered and incomplete data collections exist instead of a
complete collection of spectral or molecular property data. Hundreds of
labs
have their own little private collections.

O5) If meta-data such as toxicity data, spectral properties etc. are not
published in the appropriate way or just get lost  then this is a waste of
resources, money, time and in case of toxicity test when animals are
involved this is even an severe ethical issue. This data loss is
additionally a double-pay-feature, because chemists have to repeat the
experiments again and again (lets say for NMR spectra) and after 2 or 3
people have it confirmed such copyrighted spectra are collected and sold
back to the chemists. Thousands of man-years of research wasted down the
drain.

In case of meta-data such as molecular spectra or molecular properties the
target for change should not be the commercial publisher but the
scientists'
themselves. One assumption is that publishing in an OA journal does not
mean
that spectral metadata in CML format is automatically included. Another
assumption is that there will be a mix of open access and commercial
publishers also in the near future.

A1) The power of changing publishing behavior lies in the hand of
Editorial
Boards and Editors. These are usually honorary or experienced scientists
in
their field. If they can be convinced that supporting spectral data and
chemical property data as CMS or XML is valuable to "their" journal or to
science in general they have the power to change that.

A2) Additional power lies in the hand of reviewers by gradually requesting
that as much as possible spectral data and chemical property data is
submitted electronically to an open access repository with every
publication. They could also forbid a submission if no such minimal data
is
delivered.

A3) Some power lies in the hand of chemists themselves by just submitting
spectral or property data as CML or XML supplement with every publication
(requires change of mindset and currently means some more work).

To solve these problems there are some requirements.

R1) The software tools for an easy extraction and submission of spectral
data or experimental molecular property data should exist. For spectral
data
this can be either free software (as the existing BioEclipse) or any new
commercial software.

R2)  The problem of linking molecular structures to molecular data to
publications is not yet solved. This is a very chemistry specific problem.
This includes the InChI codes and PubChem IDs as unique identifiers for
molecule structures (with many unresolved problems) and their connection
to
the properties and spectra and the linking to the publication via the DOI
number. See also http://sourceforge.net/projects/spectra-chem

R3) Open access and commercial publishers should be directly involved in
such a process, because the metadata should be linked to the publication
itself via the DOI number and the meta-data should have a DOI pointer or
any
other link to the publication.

R4) The data structures (how data is sent to a database, definitions of
CML
or XML files) most follow minimum standards. The submission process should
start immediately, because definition of data standards can take decades.
Common sense would be a good starting point. Existing exchange formats can
be directly used (JCAMP, netCDF, CML). For example in case of mass
spectra,
this would be the name, INChI, Pubchem ID, DOI, formula, MW, m/z value and
intensity.

Solutions must include academia and commercial publishers and commercial
databases or software providers. The transformation of open chemistry data
collections will come without question, but this should be considered as a
chance or opportunity for new services. Many commercial cheminformatics
companies operate on the forefront of technology. So instead of copying
data
out of paper or PDF journals (a job which can be done by computers) they
could free their workforce from this boring task and let them work on
truly
new innovations.

S1) The collection and submission of data with every publication must
start
now. Think of it like the eternal beta state in Web 2.0. This must be
triggered by a paradigm shift (revolution) or a petition (slow) or an
organization like Blue Obelisk (small but growing).

S2) Targets must be Editorial Boards and Editors and later reviewers of
the
most innovative OA and commercial chemistry, biochemistry and
chemoinformatics journals. They must be convinced that open data
supplements
are good for science. There should be a requirement to supply such data
with
every publication.

S3) The meta-data must be submitted as open accessible (OA) supplement to
the journal or to an open-data collector such as NMRShiftDB or SPECTRa or
RedHen Spectra or CrystalEye. The publication itself can still be
copyrighted if needed. The problem is that currently only the NMRShiftDB
is
in a complete working state. A good solution would be one global open data
collector for chemistry (like hosted on SourceForge) instead of many
specialized solutions.

S4) Data Format Dogmatism should be kept outside; for molecular property
data even EXCEL XML or Open Office XML data or SQL dumps should be
allowed.
For spectral data only exchange formats like JCAMP, netCDF, CML or XML
should be used. Supporting information as PDF or JPG for data collections
should be forbidden. This is due to multiple problems converting it back
to
machine readable data.

S5) The spectral data, molecular property data and molecular structures
must
be published under a open data license which allows commercial and
non-commercial reuse and redistribution (like Creative Commons Attribution
CC-by). Commercial reuse is important because data curation still costs
money. New innovative chemistry software or databases would rely on such
large open data collections. Open science can take theses data collections
and provide basic services, hence push science itself and also commercial
operations forward in innovation.



_______________________________________________
Blue-obelisk mailing list
Blue-obelisk@hardly.cubic.uni-koeln.de
http://hardly.cubic.uni-koeln.de/mailman/listinfo/blue-obelisk

_______________________________________________
Blue-obelisk mailing list
Blue-obelisk@hardly.cubic.uni-koeln.de
http://hardly.cubic.uni-koeln.de/mailman/listinfo/blue-obelisk

Reply via email to