Dear all,

I've been making some changes to the SMILES canonicalization code. My
original intent was to get the code to efficiently and correctly
generate SMILES for fragments of molecules (demo of this below). Along
the way I realized that I could make the canonicalization faster and,
I think, more robust with a small algorithm change (I can describe
this at some point if anyone is interested).

Once I'd made the change, I needed some way to test the new code. The
RDKit has a lot of tests, but this kind of changes calls for a more
extensive torture test. I wanted a set of molecules with some
structural complexity and a pretty high density of stereochemistry
(this is always where the problems are going to come), so I chose the
ZINC ZNP subset (http://zinc.docking.org/subsets/znp). This is a nice
set of ~200K molecules.

The test I devised was the following :

1) Read a molecule from the sdf
2) generate canonical smiles csmi
3) Parse csmi to give a new molecule
4) generate a new canonical smiles and make sure it matches csmi
5) Pick 5 random atoms in the molecule and, for each one:
    5a) generate a non-canonical smiles rooted at that atom
    5b) parse that non-canonical smiles to give a new molecule
    5c) generate a new canonical smiles from that and make sure it matches csmi

The current status of the code in the MolFragmentCanon_22May2012
branch passes all RDKit tests as well as the above torture test for
the ZINC ZNP subset. I plan to merge it back onto the trunk in the
next few days.

If anyone has recommendations for alternate test methodologies or test
sets, please let me know. These tests aren't exactly super fast, so
I'd like to avoid something like "just run the {pubchem, emolecules,
full ZINC} set", but if people are convinced that's necessary, I can
set it up and run it.

-greg

For those who are interested, here's a demo of the new
MolFragmentToSmiles function:

In [2]: m = Chem.MolFromSmiles(r'CC(C)C(=O)/C=C1Nc2ccccc2NC\1=C\C(=O)C(C)C')

In [3]: Chem.MolFragmentToSmiles?
Type:       function
Base Class: <type 'builtin_function_or_method'>
String Form:<Boost.Python.function object at 0x2fce000>
Namespace:  Interactive
Docstring:
MolFragmentToSmiles( (Mol)mol, (object)atomsToUse [,
(object)bondsToUse=0 [, (object)atomSymbols=0 [, (object)bondSymbols=0
[, (bool)isomericSmiles=False [, (bool)kekuleSmiles=False [,
(int)rootedAtAtom=-1 [, (bool)canonical=True [,
(bool)allBondsExplicit=False]]]]]]]]) -> str :
    Returns the canonical SMILES string for a fragment of a molecule
      ARGUMENTS:

        - mol: the molecule
        - atomsToUse : a list of atoms to include in the fragment
        - bondsToUse : (optional) a list of bonds to include in the fragment
                       if not provided, all bonds between the atoms provided
                       will be included.
        - atomSymbols : (optional) a list with the symbols to use for the atoms
                        in the SMILES. This should have be
mol.GetNumAtoms() long.
        - bondSymbols : (optional) a list with the symbols to use for the bonds
                        in the SMILES. This should have be
mol.GetNumBonds() long.
        - isomericSmiles: (optional) include information about
stereochemistry in
          the SMILES.  Defaults to false.
        - kekuleSmiles: (optional) use the Kekule form (no aromatic bonds) in
          the SMILES.  Defaults to false.
        - rootedAtAtom: (optional) if non-negative, this forces the SMILES
          to start at a particular atom. Defaults to -1.
        - canonical: (optional) if false no attempt will be made to canonicalize
          the molecule. Defaults to true.
        - allBondsExplicit: (optional) if true, all bond orders will
be explicitly indicated
          in the output SMILES. Defaults to false.

      RETURNS:

        a string



    C++ signature :
        std::string
MolFragmentToSmiles(RDKit::ROMol,boost::python::api::object
[,boost::python::api::object=0 [,boost::python::api::object=0
[,boost::python::api::object=0 [,bool=False [,bool=False [,int=-1
[,bool=True [,bool=False]]]]]]]])

In [4]: Chem.MolFragmentToSmiles(m,(0,1,2,3,7,8,9))
Out[4]: 'ccN.CC(C)C'

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Rdkit-devel mailing list
Rdkit-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-devel

Reply via email to