Dear all, I've been making some changes to the SMILES canonicalization code. My original intent was to get the code to efficiently and correctly generate SMILES for fragments of molecules (demo of this below). Along the way I realized that I could make the canonicalization faster and, I think, more robust with a small algorithm change (I can describe this at some point if anyone is interested).
Once I'd made the change, I needed some way to test the new code. The RDKit has a lot of tests, but this kind of changes calls for a more extensive torture test. I wanted a set of molecules with some structural complexity and a pretty high density of stereochemistry (this is always where the problems are going to come), so I chose the ZINC ZNP subset (http://zinc.docking.org/subsets/znp). This is a nice set of ~200K molecules. The test I devised was the following : 1) Read a molecule from the sdf 2) generate canonical smiles csmi 3) Parse csmi to give a new molecule 4) generate a new canonical smiles and make sure it matches csmi 5) Pick 5 random atoms in the molecule and, for each one: 5a) generate a non-canonical smiles rooted at that atom 5b) parse that non-canonical smiles to give a new molecule 5c) generate a new canonical smiles from that and make sure it matches csmi The current status of the code in the MolFragmentCanon_22May2012 branch passes all RDKit tests as well as the above torture test for the ZINC ZNP subset. I plan to merge it back onto the trunk in the next few days. If anyone has recommendations for alternate test methodologies or test sets, please let me know. These tests aren't exactly super fast, so I'd like to avoid something like "just run the {pubchem, emolecules, full ZINC} set", but if people are convinced that's necessary, I can set it up and run it. -greg For those who are interested, here's a demo of the new MolFragmentToSmiles function: In [2]: m = Chem.MolFromSmiles(r'CC(C)C(=O)/C=C1Nc2ccccc2NC\1=C\C(=O)C(C)C') In [3]: Chem.MolFragmentToSmiles? Type: function Base Class: <type 'builtin_function_or_method'> String Form:<Boost.Python.function object at 0x2fce000> Namespace: Interactive Docstring: MolFragmentToSmiles( (Mol)mol, (object)atomsToUse [, (object)bondsToUse=0 [, (object)atomSymbols=0 [, (object)bondSymbols=0 [, (bool)isomericSmiles=False [, (bool)kekuleSmiles=False [, (int)rootedAtAtom=-1 [, (bool)canonical=True [, (bool)allBondsExplicit=False]]]]]]]]) -> str : Returns the canonical SMILES string for a fragment of a molecule ARGUMENTS: - mol: the molecule - atomsToUse : a list of atoms to include in the fragment - bondsToUse : (optional) a list of bonds to include in the fragment if not provided, all bonds between the atoms provided will be included. - atomSymbols : (optional) a list with the symbols to use for the atoms in the SMILES. This should have be mol.GetNumAtoms() long. - bondSymbols : (optional) a list with the symbols to use for the bonds in the SMILES. This should have be mol.GetNumBonds() long. - isomericSmiles: (optional) include information about stereochemistry in the SMILES. Defaults to false. - kekuleSmiles: (optional) use the Kekule form (no aromatic bonds) in the SMILES. Defaults to false. - rootedAtAtom: (optional) if non-negative, this forces the SMILES to start at a particular atom. Defaults to -1. - canonical: (optional) if false no attempt will be made to canonicalize the molecule. Defaults to true. - allBondsExplicit: (optional) if true, all bond orders will be explicitly indicated in the output SMILES. Defaults to false. RETURNS: a string C++ signature : std::string MolFragmentToSmiles(RDKit::ROMol,boost::python::api::object [,boost::python::api::object=0 [,boost::python::api::object=0 [,boost::python::api::object=0 [,bool=False [,bool=False [,int=-1 [,bool=True [,bool=False]]]]]]]]) In [4]: Chem.MolFragmentToSmiles(m,(0,1,2,3,7,8,9)) Out[4]: 'ccN.CC(C)C' ------------------------------------------------------------------------------ Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ _______________________________________________ Rdkit-devel mailing list Rdkit-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-devel