Hello Egon (and other users), Thanks for the reply! As to your specific points, my sources (for now) are MDL v2000 molfiles obtained from KEGG. These contain 2d coordinates and stereo markers in the connection table, namely for none, up, down or wiggly (which I guess you know, since you are listed as an author on the CDK MDLV2000Reader) by "defined" I mean "two flat bonds, one up and one down bond are specified for that center" (an implicit hydrogen counts as 'opposite to whatever other bond was given'). I also count wiggly bonds as 'defined', because they make explicit that a stereoconformation is present, but it's unknown which one it is. Everything else is undefined. I guess my usage coincides with the synonyms ambiguous/unambiguous, or in the context of the blog post you provided defined means both 'known absolute' and 'unknown absolute', whereas 'undefined' maps to 'unspecified'. (interesting article by the way, I hadn't even considered some of the intermediate terms.. Thanks for the pointer!),
Some use-cases: Vitamin K1 epoxide: in Biometa: http://cheminf.cmbi.ru.nl/cgi-bin/biometa/biometa.py?context=molecules&str_disp=jme&mol_id=MC001223 in Kegg: http://www.genome.jp/dbget-bin/www_bget?cpd:C05849 Here I have a (partially) defined stereo-chemistry in my own database, where Kegg has none. Of course I'd rather keep the one with as much stereo-information as possible. the truly interesting stereocenters at the epoxide are unfortunately unknown in both.. In this case I would like to keep the Biometa-one, as it has a smaller amount of ambiguous stereocenters. Okadaic acid: in Biometa: http://cheminf.cmbi.ru.nl/cgi-bin/biometa/biometa.py?context=molecules&str_disp=jme&mol_id=MC001756 in Kegg: http://www.genome.jp/dbget-bin/www_bget?compound+C01945 In this case, stereochemistry is equally well defined in both databases (they both miss the implicit hydrogen in the two rightmost 6-rings with O-bridge), but all the other stereochemistry is very well defined. In both cases I would like to identify the unspecified/ambiguous stereocenters as such (so that I can label them with an asterisk or something). This is what I mean by "detecting undefined stereocenters with the CIPtool" Regarding double bonds: the KEGG input files do indeed provide 2D coordinates, under which package is the tool you mentioned located? And how does it handle the different ways to specify 'unknown E/Z'. So far I know of using wavy bonds and/or drawing it 180 degrees from the double bond. (There's probably more that I haven't seen yet). In summary: I want to be able to distinguish between completely and incompletely specified sp3 bonds and double bonds. At the moment purely for counting purposes. I hope this clarifies my intentions enough. If not, please continue to ask ad infinitum until you understand me, people not taking the time to fully specify what they mean is what caused the whole stereo-mess in the first place ;-) . I'll gladly await your CIPtool patch. My background is a study of molecular life sciences while hobbying in chemistry and computer programming. So while I'll probably be able to understand most chemistry and code you throw at me; writing complex stuff like CIP rule interpreters is taxing for me. You're welcome to ask me for tests however! Kind regards! Jules On 14 June 2010 11:30, Egon Willighagen <egon.willigha...@gmail.com> wrote: > Dear Jules, > > On Fri, Jun 11, 2010 at 12:56 PM, Jules Kerssemakers > <j.kerssemak...@cmbi.ru.nl> wrote: >> Hi, I'm Jules Kerssemakers, a recently started bioinformatics PhD >> student at the group of Gert Vriend (CMBI, netherlands). > > I will email you and Gert shortly to try to set up a meeting at the > end of July, if suitable for all our agenda's. > >> I work on the BioMeta database (http://biometa.cmbi.ru.nl) for >> metabolites and metabolism, which I took over from my predecessor, >> Martin Ott. > > Cool! > >> Primary item on the agenda (before I start the actual science-y work) >> is an update of the information contained in the database (it's mostly >> based on the 2005 version of the KEGG database) > > Sounds like a good way to start. > >> In an effort to automate this update procedure, I discovered the CDK. >> I can see it's a very powerful toolkit, but I'm having some trouble >> navigating the feature-set. >> To compare the molecules-to-update, I'm interested in the amount of >> defined/undefined stereocenters and defined/undefined double bond >> configurations. > > Depending on the input. I have indeed recently been working on > tetrahedral stereochemistry, but accurate identification of stereo > centers is non-trivial. But, at least if the input is right, the CDK > can now assign absolute stereochemistry, .e.g as R,S using the CIP > rules. > >> Does the CDK have a way to calculate these properties? >> >> I already found EgonW's blog about the CIPTool, >> (http://chem-bla-ics.blogspot.com/2010/04/cip-rules-for-stereochemistry.html), >> but I haven't been able to find it in CDK v1.2.5 nor in v1.3.5. > > The code to calculate the R,S stereochemistry is not yet in 1.3.5, but > the foundation is. There is some final testing to be done regarding > the CIP code, after which I will prepare a patch against the 1.3 > series. > >> I'm also unsure if the CIPtool would let me detect undefined stereocenters. > > The CIPTool defines the stereochemistry of a stereocenter. > >> So, summing up: >> -Can the CDK count defined/undefined stereocenters > > Depends on the exact context. What is your input? What do you mean > exactly with defined/undefined? That is, can you put this in the > context of, for example, this blog post: > > http://cactus.nci.nih.gov/blog/?p=679&utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+chemicalStructureBlog+%28%2Fchemical%2Fstructure+Blog%29&utm_content=FriendFeed+Bot > >> and/or double bonds, > > The CDK has an algorithm to define stereochemistry from 2D > coordinates, so if your input has that information... regarding that, > it should also have code to define stereochemistry based on wedge bond > information... > >> or do I need to do some (heavy) programming myself? > > I do not know enough about your particular use case to decide how much > programming would be involved, and what of your needs is already > available... > > Egon > > -- > Post-doc @ Uppsala University > Proteochemometrics / Bioclipse Group of Prof. Jarl Wikberg > Homepage: http://egonw.github.com/ > Blog: http://chem-bla-ics.blogspot.com/ > PubList: http://www.citeulike.org/user/egonw/tag/papers > ------------------------------------------------------------------------------ ThinkGeek and WIRED's GeekDad team up for the Ultimate GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the lucky parental unit. See the prize list and enter to win: http://p.sf.net/sfu/thinkgeek-promo _______________________________________________ Cdk-user mailing list Cdk-user@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/cdk-user