Abe Heifets has been talking with me about an Open Babel-based retrosynthesis and synthetic difficulty engine. (Abe, I hope that's a fair summary.) He has been working some on the code itself, but has also been looking for an open database of synthetic reactions for training.
It seems like if we can enhance the ChemDraw CDX support to handle reactions better, he can data-mine Google's chemical patents. Chris… how would the ChemDraw code handle files? For some files, you'd place a call and not know if you get back an OBMol or OBReaction. How do you handle this in CML? In general, we'd just need to support some new tags in CDX, which isn't a huge deal if I can wrap my brain around the calling convention. I'm not aware of other formats where you try to read a file and might get back something you don't expect (i.e., it's either an SD file or RXN file, but not both). Thanks, -Geoff Begin forwarded message: > Hi Geoff, > > When we skyped, I mentioned that I was looking for good reaction > databases. I may have found one and I'd like your help to get at it. > > Google recently released 10 years (and 10 terabytes) of US patents for > free download [1]. The chemical patents come with CDX files and I'd > like to be able to pull out the reactions in a format that's more > amenable to further analysis, as per your and Wolf's discussion [2]. > For my purposes, atom-mapping would be useful but capturing all of the > reaction conditions would probably not be necessary. > > How hard do you think it would be to extend OpenBabel to pull out > reactions from CDX files? Here are some examples: > http://www.google.com/patents?id=SEAxAAAAEBAJ&zoom=4&pg=PA4#v=onepage&q&f=false > http://www.google.com/patents?id=XUQVAAAAEBAJ&zoom=4&pg=PA3#v=onepage&q&f=false > http://www.google.com/patents?id=oeIDAAAAEBAJ&zoom=4&pg=PA4#v=onepage&q&f=false > http://www.google.com/patents?id=t-wHAAAAEBAJ&zoom=4&pg=PA3#v=onepage&q&f=false > http://www.google.com/patents?id=60wAAAAAEBAJ&zoom=4&pg=PA9#v=onepage&q&f=false > These aren't the most straightforward set of reactions but, even if we > can extract 80% of reactions, that'd be valuable. > > > Also, right now it's easy to pull out a pile of molecules from the > patents. Please let me know if you've got a use for this kind of data > and I can get it to you. > > Cheers, > Abe > > [1] > http://googlepublicpolicy.blogspot.com/2010/06/free-download-10-terabytes-of-patents.html > [2] > http://depth-first.com/articles/2010/09/17/reading-and-translating-chemdraw-cdx-files-with-openbabel/ > > > > On Mon, Nov 22, 2010 at 10:24 PM, A. Heifets <abe-...@cs.toronto.edu> wrote: >> Ok, see you at 11:30. >> >> On Mon, Nov 22, 2010 at 5:48 PM, Geoffrey Hutchison <geo...@pitt.edu> wrote: >>> Abe, >>> >>> I completely forgot another meeting tomorrow until 11:30, and I have a >>> follow-up at 12:30. If that's OK, I'll see you then. I'm "ghutchis." >>> >>> Thanks, >>> -Geoff > > -- > A. Heifets > http://www.cs.toronto.edu/~aheifets/ --- Prof. Geoffrey Hutchison Assistant Professor, Department of Chemistry University of Pittsburgh http://hutchison.chem.pitt.edu/ Office: (412) 648-0492 ------------------------------------------------------------------------------ Gaining the trust of online customers is vital for the success of any company that requires sensitive data to be transmitted over the Web. Learn how to best implement a security strategy that keeps consumers' information secure and instills the confidence they need to proceed with transactions. http://p.sf.net/sfu/oracle-sfdevnl _______________________________________________ OpenBabel-Devel mailing list OpenBabel-Devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/openbabel-devel