On 11/01/2011 03:42, Geoffrey Hutchison wrote: > Abe Heifets has been talking with me about an Open Babel-based retrosynthesis > and synthetic difficulty engine. (Abe, I hope that's a fair summary.) He has > been working some on the code itself, but has also been looking for an open > database of synthetic reactions for training. > > It seems like if we can enhance the ChemDraw CDX support to handle reactions > better, he can data-mine Google's chemical patents. > > Chris… how would the ChemDraw code handle files? For some files, you'd place > a call and not know if you get back an OBMol or OBReaction. How do you handle > this in CML?
There was a thread on reading reactions from ChemDraw files on the OpenBabel-Devel list starting with Jean Brefort on Oct28 2007. More recently, Rich Apodaca http://depth-first.com/articles/2010/09/17/reading-and-translating-chemdraw-cdx-files-with-openbabel/ seems to be have some ideas on this, but we don't know what they are yet. It is something I have been interested in, but have been put off in the past because of having neither a good set of example files nor the capability of writing Chemdraw files. The article above suggests how to get round this. When reading ChemDraw input, this a possible way: all molecules found would be converted to OBMols on the heap. When a reaction was recognized it would be output (via AddChemObject()) as an OBReaction. This contains shared_ptrs to its OBMols, meaning you don't have to worry about deleting them. Any molecules not part of a reaction would be output as OBMols and deleted in the normal way. This mixed output OBMol/OBReaction can be handled by CML and CMLR formats. (Because objects are passed between input and output as OBBase objects, an output format can use dynamic_cast to handle anything.) Reaction formats like RXN would see only the reactions. Most molecule formats would see all the molecules, including those in the reactions. I guess the biggest challenge is to recognize which molecules are the reactants, products or agents of a reaction, since CDX files are essentially drawings. Currently OBReaction represents only a single step of a reaction. The information from multistep schemes can be still efficiently recorded (molecules are by reference), but an extension might be worthwhile. The CML schema includes multistep reaction, but OB doesn't. Chris > In general, we'd just need to support some new tags in CDX, which isn't a > huge deal if I can wrap my brain around the calling convention. I'm not aware > of other formats where you try to read a file and might get back something > you don't expect (i.e., it's either an SD file or RXN file, but not both). > > Thanks, > -Geoff > > Begin forwarded message: > >> Hi Geoff, >> >> When we skyped, I mentioned that I was looking for good reaction >> databases. I may have found one and I'd like your help to get at it. >> >> Google recently released 10 years (and 10 terabytes) of US patents for >> free download [1]. The chemical patents come with CDX files and I'd >> like to be able to pull out the reactions in a format that's more >> amenable to further analysis, as per your and Wolf's discussion [2]. >> For my purposes, atom-mapping would be useful but capturing all of the >> reaction conditions would probably not be necessary. >> >> How hard do you think it would be to extend OpenBabel to pull out >> reactions from CDX files? Here are some examples: >> http://www.google.com/patents?id=SEAxAAAAEBAJ&zoom=4&pg=PA4#v=onepage&q&f=false >> http://www.google.com/patents?id=XUQVAAAAEBAJ&zoom=4&pg=PA3#v=onepage&q&f=false >> http://www.google.com/patents?id=oeIDAAAAEBAJ&zoom=4&pg=PA4#v=onepage&q&f=false >> http://www.google.com/patents?id=t-wHAAAAEBAJ&zoom=4&pg=PA3#v=onepage&q&f=false >> http://www.google.com/patents?id=60wAAAAAEBAJ&zoom=4&pg=PA9#v=onepage&q&f=false >> These aren't the most straightforward set of reactions but, even if we >> can extract 80% of reactions, that'd be valuable. >> >> >> Also, right now it's easy to pull out a pile of molecules from the >> patents. Please let me know if you've got a use for this kind of data >> and I can get it to you. >> >> Cheers, >> Abe >> >> [1] >> http://googlepublicpolicy.blogspot.com/2010/06/free-download-10-terabytes-of-patents.html >> [2] >> http://depth-first.com/articles/2010/09/17/reading-and-translating-chemdraw-cdx-files-with-openbabel/ >> >> >> >> On Mon, Nov 22, 2010 at 10:24 PM, A. Heifets<abe-...@cs.toronto.edu> wrote: >>> Ok, see you at 11:30. >>> >>> On Mon, Nov 22, 2010 at 5:48 PM, Geoffrey Hutchison<geo...@pitt.edu> wrote: >>>> Abe, >>>> >>>> I completely forgot another meeting tomorrow until 11:30, and I have a >>>> follow-up at 12:30. If that's OK, I'll see you then. I'm "ghutchis." >>>> >>>> Thanks, >>>> -Geoff >> >> -- >> A. Heifets >> http://www.cs.toronto.edu/~aheifets/ > > --- > Prof. Geoffrey Hutchison > Assistant Professor, Department of Chemistry > University of Pittsburgh > http://hutchison.chem.pitt.edu/ > Office: (412) 648-0492 ------------------------------------------------------------------------------ Gaining the trust of online customers is vital for the success of any company that requires sensitive data to be transmitted over the Web. Learn how to best implement a security strategy that keeps consumers' information secure and instills the confidence they need to proceed with transactions. http://p.sf.net/sfu/oracle-sfdevnl _______________________________________________ OpenBabel-Devel mailing list OpenBabel-Devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/openbabel-devel