The discussion on InChIs raises the question as to who creates and 
manages communal resources and metadata. InChIs work as they are 
algorithmic but they fail for inorganics (especially mineral 
polymorphs) and substances ("glucose", "glutamate"), etc. where 
conventional human-assigned identifiers are possible. If the 
substances are sufficiently common they will be in Wikipedia and that 
should work as an excellent mechanism, and if they are in Pubchem 
that is also a possibility. But if they are new - "proposed molecule 
X" or in a publication, then we need something else.

A similar problem arises with images for structures. There are 
several types, but specifically (a) semantic, often ugly but 
machine-compatible and (b) pretty - cf TotallySynth - often with 
unclear machine semantics (e.g. perspective). Both are needed - the 
MathML community also has this problem.

Wikipedia solves the image problem by providing a repository of 
images and allowing multiple link throughs.  2 years ago I thought 
about providing an image drawing service for blogs - draw once, use 
many. If, say, we had JChempaint mounted on our server anyone could 
draw their image for the blog and link to it. The killer was that we 
couldn't (a) provide a robust server and (b) demand might kill it.

But now we have unlimited free storage everywhere. So what about:
(a) there are a number of molecule drawing sites. (Obviously we can't 
provide ChemDraw, but most others would allow it - Marvin, JME, ACD). 
The author would draw an structure and - for organics - get the 
InChI. The service would immediately search the BO server space for 
the identical InChI (or, excitingly) any InChI related by layers. If 
it found other InChIs it could display these to the author, who might 
wish to use one with, say, fuller stereochemistry. Or a prettier 
version (e.g. for macrocycles).  There might even be some clever 
language processing - e.g. paste a name from a journal and get the 
structure - Peter has been looking into that.
(b) the software generates an image and names it with a unique name. 
Probably not the InChI but either Pubchem CID or a BO ID (see below). 
The we post it to Flickr or Google or wherever. These sites remain 
stable so that authors could link to them from their blog. It would 
depend on the blog software how easy it was to download images into 
the text - Wordpress seems to do local files but not URLs - unless I 
have missed something. But actually cut and paste from remote images 
seems to work in many cases.

So I suggest that we might need a BO identifier. It needs to be 
nearly unique. (If it collides once every few years the blogosphere 
will forgive you). I suspect that an MD5, or a datetime should be OK 
and this could be kept relatively short. Or we could simply ask the 
servers to assign ids sequentially and use new ones if they collide. 
We can't be the first to do this. We aren't running a bank or a 
nuclear power station so a problems won't matter.

We'd have to have a bidirectional lookup for this identifier. InChi 
<==> BO. That could be done with RDF, and would be a fun exercise. If 
we get above 100,000 triples we will have succeeded anyway. There are 
people who are offering to host triple services for free. We could 
probably put it in our institutional repository for indexing chemistry theses.

Anyway that is a first shot...

P.



Peter Murray-Rust
Unilever Centre for Molecular Sciences Informatics
University of Cambridge,
Lensfield Road,  Cambridge CB2 1EW, UK
+44-1223-763069 


-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Blueobelisk-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/blueobelisk-discuss

Reply via email to