You might find this link useful - http://www.rdkit.org/docs/GettingStartedInPython.html#chemical-transformations

However, the issue in your case is SMARTS definitions. If one SMARTS completely covers another one it would be difficult to understand is it artifact or not.I think it might be reasonable to revise SMARTS to avoid such overlapping or create a list of rules (maybe hierarchical) which will define valid and not valid overlappings.


Pavel.


On 03/08/2017 06:32 PM, Chenyang Shi wrote:
Dear Hongbin,

I tried your method on a molecule, 4-Methylsalicylic acid (CC1=CC(=C(C=C1)C(=O)O)O). I looped through all groups defined in Joback method (using SMARTS), and used m.GetSubstructMatches to print out all atom positions. The result is summarized in the table.

We can see there are duplicated counts--coming from COOH group. As suggested by Hongbin, we can remove duplicated atoms by looking at their positions--in this case, ((9),), ((7,8,),), ((7,),), and ((8,),) are subsets of ((7,8,9)) from -COOH. Indeed we can get rid of these duplicates. However, I also noticed that Atom (3,) from =C< (ring) group is also a part of -OH (phenol) ((10,3),). If we apply the same algorithm to remove duplicates, the =C<(ring) group will be only counted twice instead of three times.

Greg, you mentioned as an alternative I can delete substructure using chemical reaction method. It would be greatly appreciated if you could show me (point me to) a simple example code, perhaps on a simple molecule? I find myself at a loss when browsing the manual. I would like to try also in that direction.

Thanks,
Chenyang


Inline image 1


On Mon, Mar 6, 2017 at 1:52 AM, Greg Landrum <greg.land...@gmail.com <mailto:greg.land...@gmail.com>> wrote:

    The solution that Hongbin proposes to the double-counting problem
    is a good one. Just be sure to sort your substructure queries in
    the right order so that the more complex ones come first.

    Another thing you might think about is making your queries more
    specific. For example, as you pointed out "[OH]" is very general
    and matches parts of carboxylic acids and a number of other
    functional groups. The RDKit has a set of fairly well tested
    (though certainly not perfect) functional group definitions in
    $RDBASE/Data/Functional_Group_Hierarchy.txt. The alcohol
    definition from there looks like this:
    [O;H1;$(O-!@[#6;!$(C=!@[O,N,S])])]


    -greg


    On Mon, Mar 6, 2017 at 7:20 AM, 杨弘宾 <yanyangh...@163.com
    <mailto:yanyangh...@163.com>> wrote:

        Hi, Chenyang,
        You don't need to delete the substructure from the molecule.
        Just check whehter the mapped atoms have been matched. For
        example:

        m = Chem.MolFromSmiles('CC(=O)O')
        OH = Chem.MolFromSmarts('[OH]')
        COOH = Chem.MolFromSmarts('C(O)=O')

        m.GetSubstructMatches(OH)
        >>((3,),)
        m.GetSubstructMatchs(COOH)
        >>((1, 3, 2),)

        Since atom "3" has been already matched, it should be ignored.
        So you can create a "set" to record the matched atoms to avoid
        repetitive count.

        ------------------------------------------------------------------------
        Hongbin Yang 杨弘宾

            *From:* Chenyang Shi <mailto:cs3...@columbia.edu>
            *Date:* 2017-03-06 14:04
            *To:* Greg Landrum <mailto:greg.land...@gmail.com>
            *CC:* RDKit Discuss
            <mailto:rdkit-discuss@lists.sourceforge.net>
            *Subject:* Re: [Rdkit-discuss] delete a substructure
            Hi Greg,

            Thanks for a prompt reply. I did try
            "GetSubstructMatches()" and it returns correct numbers of
            substructures for CH3COOH. The potential problem with this
            approach is that if the molecule is getting complicated,
            it will possibly generate duplicate numbers for certain
            functional groups. For example, --OH (alcohol) group will
            be likely also counted in --COOH. A safer way, in my mind,
            is to remove the substructure that has been counted.

            Greg, you mentioned "chemical reaction functionality", can
            you show me a demo script with that using CH3COOH as an
            example. I will definitely delve into the manual to learn
            more. But reading your code will be a good start.

            Thanks,
            Chenyang


            On Sun, Mar 5, 2017 at 10:15 PM, Greg Landrum
            <greg.land...@gmail.com <mailto:greg.land...@gmail.com>>
            wrote:

                Hi Chenyang,

                If you're really interested in counting the number of
                times the substructure appears, you can do that much
                quicker with `GetSubstructMatches()`:

                In [2]: m = Chem.MolFromSmiles('CC(C)CCO')
                In [3]:
                len(m.GetSubstructMatches(Chem.MolFromSmarts('[CH3;X4]')))
                Out[3]: 2

                Is that sufficient, or do you actually want to
                sequentially remove all of the groups in your list?

                If you actually want to remove them, you are probably
                better off using the chemical reaction functionality
                instead of DeleteSubstructs(), which recalculates the
                number of implicit Hs on atoms after each call.

                -greg


                On Mon, Mar 6, 2017 at 4:21 AM, Chenyang Shi
                <cs3...@columbia.edu <mailto:cs3...@columbia.edu>> wrote:

                    I am new to rdkit but I am already impressed by
                    its vibrant community. I have a question regarding
                    deleting substructure. In the RDKIT documentation,
                    this is a snippet of code describing how to delete
                    substructure:

                    >>>m = Chem.MolFromSmiles("CC(=O)O")
                    >>>patt = Chem.MolFromSmarts("C(=O)[OH]")
                    >>>rm = AllChem.DeleteSubstructs(m, patt)
                    >>>Chem.MolToSmiles(rm)
                    'C'

                    This block of code first loads a molecule CH3COOH
                    using SMILES code, then defines a substructure
                    COOH using SMARTS code which is to be deleted.
                    After final line of code, the program outputs 'C',
                    in SMILES form.

                    I had wanted to develop a method for detecting
                    number of groups in a molecule. In CH3COOH case, I
                    can search number of --CH3 and --COOH group by
                    using their respective SMARTS code with no
                    problem. However, when molecule becomes more
                    complicated, it is preferred to delete the
                    substructure that has been searched before moving
                    to next search using SMARTS code. Well, in current
                    case, after searching -COOH group and deleting it,
                    the leftover is 'C' which is essentially CH4
                    instead of --CH3. I cannot proceed with searching
                    with SMARTS code for --CH3 ([CH3;A;X4!R]).

                    Is there any way to work around this?
                    Thanks,
                    Chenyang


                    
------------------------------------------------------------------------------
                    Check out the vibrant tech community on one of the
                    world's most
                    engaging tech sites, SlashDot.org!
                    http://sdm.link/slashdot
                    _______________________________________________
                    Rdkit-discuss mailing list
                    Rdkit-discuss@lists.sourceforge.net
                    <mailto:Rdkit-discuss@lists.sourceforge.net>
                    https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
                    <https://lists.sourceforge.net/lists/listinfo/rdkit-discuss>




        
------------------------------------------------------------------------------
        Check out the vibrant tech community on one of the world's most
        engaging tech sites, SlashDot.org! http://sdm.link/slashdot
        _______________________________________________
        Rdkit-discuss mailing list
        Rdkit-discuss@lists.sourceforge.net
        <mailto:Rdkit-discuss@lists.sourceforge.net>
        https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
        <https://lists.sourceforge.net/lists/listinfo/rdkit-discuss>





------------------------------------------------------------------------------
Announcing the Oxford Dictionaries API! The API offers world-renowned
dictionary content that is easy and intuitive to access. Sign up for an
account today to start using our lexical data to power your apps and
projects. Get started today and enter our developer competition.
http://sdm.link/oxford


_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

------------------------------------------------------------------------------
Announcing the Oxford Dictionaries API! The API offers world-renowned
dictionary content that is easy and intuitive to access. Sign up for an
account today to start using our lexical data to power your apps and
projects. Get started today and enter our developer competition.
http://sdm.link/oxford
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to