On 11/01/2011 03:42, Geoffrey Hutchison wrote:
> Abe Heifets has been talking with me about an Open Babel-based retrosynthesis 
> and synthetic difficulty engine. (Abe, I hope that's a fair summary.) He has 
> been working some on the code itself, but has also been looking for an open 
> database of synthetic reactions for training.
>
> It seems like if we can enhance the ChemDraw CDX support to handle reactions 
> better, he can data-mine Google's chemical patents.
>
> Chris… how would the ChemDraw code handle files? For some files, you'd place 
> a call and not know if you get back an OBMol or OBReaction. How do you handle 
> this in CML?

There was a thread on reading reactions from ChemDraw files on the 
OpenBabel-Devel list starting with Jean Brefort on Oct28 2007. More 
recently, Rich Apodaca
http://depth-first.com/articles/2010/09/17/reading-and-translating-chemdraw-cdx-files-with-openbabel/
seems to be have some ideas on this, but we don't know what they are 
yet. It is something I have been interested in, but have been put off 
in the past because of having neither a good set of example files nor 
the capability of writing Chemdraw files. The article above suggests 
how to get round this.

When reading ChemDraw input, this a possible way: all molecules found 
  would be converted to OBMols on the heap. When a reaction was 
recognized it would be output (via AddChemObject()) as an OBReaction. 
This contains shared_ptrs to its OBMols, meaning you don't have to 
worry about deleting them. Any molecules not part of a reaction would 
be output as OBMols and deleted in the normal way.

This mixed output OBMol/OBReaction can be handled by CML and CMLR 
formats. (Because objects are passed between input and output as 
OBBase objects, an output format can use dynamic_cast to handle 
anything.) Reaction formats like RXN would see only the reactions. 
Most molecule formats would see all the molecules, including those in 
the reactions.

I guess the biggest challenge is to recognize which molecules are the 
reactants, products or agents of a reaction, since CDX files are 
essentially drawings.

Currently OBReaction represents only a single step of a reaction. The 
information from multistep schemes can be still efficiently recorded 
(molecules are by reference), but an extension might be worthwhile. 
The CML schema includes multistep reaction, but OB doesn't.

Chris

> In general, we'd just need to support some new tags in CDX, which isn't a 
> huge deal if I can wrap my brain around the calling convention. I'm not aware 
> of other formats where you try to read a file and might get back something 
> you don't expect (i.e., it's either an SD file or RXN file, but not both).
>
> Thanks,
> -Geoff
>
> Begin forwarded message:
>
>> Hi Geoff,
>>
>> When we skyped, I mentioned that I was looking for good reaction
>> databases.  I may have found one and I'd like your help to get at it.
>>
>> Google recently released 10 years (and 10 terabytes) of US patents for
>> free download [1].  The chemical patents come with CDX files and I'd
>> like to be able to pull out the reactions in a format that's more
>> amenable to further analysis, as per your and Wolf's discussion [2].
>> For my purposes, atom-mapping would be useful but capturing all of the
>> reaction conditions would probably not be necessary.
>>
>> How hard do you think it would be to extend OpenBabel to pull out
>> reactions from CDX files?  Here are some examples:
>> http://www.google.com/patents?id=SEAxAAAAEBAJ&zoom=4&pg=PA4#v=onepage&q&f=false
>> http://www.google.com/patents?id=XUQVAAAAEBAJ&zoom=4&pg=PA3#v=onepage&q&f=false
>> http://www.google.com/patents?id=oeIDAAAAEBAJ&zoom=4&pg=PA4#v=onepage&q&f=false
>> http://www.google.com/patents?id=t-wHAAAAEBAJ&zoom=4&pg=PA3#v=onepage&q&f=false
>> http://www.google.com/patents?id=60wAAAAAEBAJ&zoom=4&pg=PA9#v=onepage&q&f=false
>> These aren't the most straightforward set of reactions but, even if we
>> can extract 80% of reactions, that'd be valuable.
>>
>>
>> Also, right now it's easy to pull out a pile of molecules from the
>> patents.  Please let me know if you've got a use for this kind of data
>> and I can get it to you.
>>
>> Cheers,
>> Abe
>>
>> [1] 
>> http://googlepublicpolicy.blogspot.com/2010/06/free-download-10-terabytes-of-patents.html
>> [2] 
>> http://depth-first.com/articles/2010/09/17/reading-and-translating-chemdraw-cdx-files-with-openbabel/
>>
>>
>>
>> On Mon, Nov 22, 2010 at 10:24 PM, A. Heifets<abe-...@cs.toronto.edu>  wrote:
>>> Ok, see you at 11:30.
>>>
>>> On Mon, Nov 22, 2010 at 5:48 PM, Geoffrey Hutchison<geo...@pitt.edu>  wrote:
>>>> Abe,
>>>>
>>>> I completely forgot another meeting tomorrow until 11:30, and I have a 
>>>> follow-up at 12:30. If that's OK, I'll see you then. I'm "ghutchis."
>>>>
>>>> Thanks,
>>>> -Geoff
>>
>> --
>> A. Heifets
>> http://www.cs.toronto.edu/~aheifets/
>
> ---
> Prof. Geoffrey Hutchison
> Assistant Professor, Department of Chemistry
> University of Pittsburgh
> http://hutchison.chem.pitt.edu/
> Office: (412) 648-0492

------------------------------------------------------------------------------
Gaining the trust of online customers is vital for the success of any company
that requires sensitive data to be transmitted over the Web.   Learn how to 
best implement a security strategy that keeps consumers' information secure 
and instills the confidence they need to proceed with transactions.
http://p.sf.net/sfu/oracle-sfdevnl 
_______________________________________________
OpenBabel-Devel mailing list
OpenBabel-Devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/openbabel-devel

Reply via email to