Tobias - Simply put, this is an excellent post and discussion. It has been one of our foci to allow for open data exchange much along the lines of Creative Commons with regard to the spectral data that we are building up, and are engaging efforts to work on the legal issues found within the ownership of the spectral data.
If you are interested in learning more about our efforts, please send an email to [EMAIL PROTECTED] Sanford On 6/1/07, Tobias Kind <[EMAIL PROTECTED]> wrote:
I think as open data advocates we should re-think our strategies towards open data spectral collections including NMR and MS and IR and crystal structure data or chemical property data. I will post this to the Blue Obelisk mailing list to obtain more input. There were some interesting discussions in the BlogOSphere, but I am still stuck in Web 1.0 so I need to post it to the BO mailing list. I will put this comment later on the new BlueObelisk wiki and collect discussions (Via votes or comments, Technology?) and we can compile a list of chemistry journals on BlueObelisk with comments on their data sharing policy on chemistry data. Additionally editors and the editorial boards will be contacted. Chemistry journal list: http://www.cas.org/expertise/cascontent/caplus/corejournals.html http://www.nlm.nih.gov/bsd/journals/subjects.html Some recent BlogLinks (2007): http://www.sennoma.net/main/archives/2006/12/where_are_the_data_can_i_have.p hp http://researchremix.wordpress.com/2007/05/30/diverse-journal-requirements-f or-data-sharing/ http://researchremix.wordpress.com/2007/05/30/diverse-journal-requirements-f or-data-sharing/ http://wwmm.ch.cam.ac.uk/blogs/murrayrust/?p=350 **************************************************************** The paradigm shift in data sharing in chemistry and the Blue Obelisk movement Tobias Kind - fiehnlab.ucdavis.edu There are some historic reasons that chemists depend on commercial databases and commercial spectral collections. These sources are usually very reliable and curated resources which are either subscription based or one time fee based. However chemists still publish in the same manner as 100 years ago. Obstacles which culminate in hindrance of science (literally think of it as the bible only published in Latin) are: O1) Spectral databases or experimental molecular property data are license protected in a way no derivate work can be obtained in an easy way on large data sets. This would be the case if from a one million spectral collection only one spectrum at a time can be investigated (instead of bulk access) or such an approach is forbidden in the EULA or license. O2) Subscription prices to data collections are to high (I am not talking about hundred dollars, but ten-thousands of dollars for a modern spectral or cheminformatics lab.) O3) A combination of both above reasons which ultimatively leads to a dead-lock in new research synergies even for well equipped academic labs and especially for smaller companies. O4) Many scattered and incomplete data collections exist instead of a complete collection of spectral or molecular property data. Hundreds of labs have their own little private collections. O5) If meta-data such as toxicity data, spectral properties etc. are not published in the appropriate way or just get lost then this is a waste of resources, money, time and in case of toxicity test when animals are involved this is even an severe ethical issue. This data loss is additionally a double-pay-feature, because chemists have to repeat the experiments again and again (lets say for NMR spectra) and after 2 or 3 people have it confirmed such copyrighted spectra are collected and sold back to the chemists. Thousands of man-years of research wasted down the drain. In case of meta-data such as molecular spectra or molecular properties the target for change should not be the commercial publisher but the scientists' themselves. One assumption is that publishing in an OA journal does not mean that spectral metadata in CML format is automatically included. Another assumption is that there will be a mix of open access and commercial publishers also in the near future. A1) The power of changing publishing behavior lies in the hand of Editorial Boards and Editors. These are usually honorary or experienced scientists in their field. If they can be convinced that supporting spectral data and chemical property data as CMS or XML is valuable to "their" journal or to science in general they have the power to change that. A2) Additional power lies in the hand of reviewers by gradually requesting that as much as possible spectral data and chemical property data is submitted electronically to an open access repository with every publication. They could also forbid a submission if no such minimal data is delivered. A3) Some power lies in the hand of chemists themselves by just submitting spectral or property data as CML or XML supplement with every publication (requires change of mindset and currently means some more work). To solve these problems there are some requirements. R1) The software tools for an easy extraction and submission of spectral data or experimental molecular property data should exist. For spectral data this can be either free software (as the existing BioEclipse) or any new commercial software. R2) The problem of linking molecular structures to molecular data to publications is not yet solved. This is a very chemistry specific problem. This includes the InChI codes and PubChem IDs as unique identifiers for molecule structures (with many unresolved problems) and their connection to the properties and spectra and the linking to the publication via the DOI number. See also http://sourceforge.net/projects/spectra-chem R3) Open access and commercial publishers should be directly involved in such a process, because the metadata should be linked to the publication itself via the DOI number and the meta-data should have a DOI pointer or any other link to the publication. R4) The data structures (how data is sent to a database, definitions of CML or XML files) most follow minimum standards. The submission process should start immediately, because definition of data standards can take decades. Common sense would be a good starting point. Existing exchange formats can be directly used (JCAMP, netCDF, CML). For example in case of mass spectra, this would be the name, INChI, Pubchem ID, DOI, formula, MW, m/z value and intensity. Solutions must include academia and commercial publishers and commercial databases or software providers. The transformation of open chemistry data collections will come without question, but this should be considered as a chance or opportunity for new services. Many commercial cheminformatics companies operate on the forefront of technology. So instead of copying data out of paper or PDF journals (a job which can be done by computers) they could free their workforce from this boring task and let them work on truly new innovations. S1) The collection and submission of data with every publication must start now. Think of it like the eternal beta state in Web 2.0. This must be triggered by a paradigm shift (revolution) or a petition (slow) or an organization like Blue Obelisk (small but growing). S2) Targets must be Editorial Boards and Editors and later reviewers of the most innovative OA and commercial chemistry, biochemistry and chemoinformatics journals. They must be convinced that open data supplements are good for science. There should be a requirement to supply such data with every publication. S3) The meta-data must be submitted as open accessible (OA) supplement to the journal or to an open-data collector such as NMRShiftDB or SPECTRa or RedHen Spectra or CrystalEye. The publication itself can still be copyrighted if needed. The problem is that currently only the NMRShiftDB is in a complete working state. A good solution would be one global open data collector for chemistry (like hosted on SourceForge) instead of many specialized solutions. S4) Data Format Dogmatism should be kept outside; for molecular property data even EXCEL XML or Open Office XML data or SQL dumps should be allowed. For spectral data only exchange formats like JCAMP, netCDF, CML or XML should be used. Supporting information as PDF or JPG for data collections should be forbidden. This is due to multiple problems converting it back to machine readable data. S5) The spectral data, molecular property data and molecular structures must be published under a open data license which allows commercial and non-commercial reuse and redistribution (like Creative Commons Attribution CC-by). Commercial reuse is important because data curation still costs money. New innovative chemistry software or databases would rely on such large open data collections. Open science can take theses data collections and provide basic services, hence push science itself and also commercial operations forward in innovation. _______________________________________________ Blue-obelisk mailing list Blue-obelisk@hardly.cubic.uni-koeln.de http://hardly.cubic.uni-koeln.de/mailman/listinfo/blue-obelisk
_______________________________________________ Blue-obelisk mailing list Blue-obelisk@hardly.cubic.uni-koeln.de http://hardly.cubic.uni-koeln.de/mailman/listinfo/blue-obelisk