Hi Ansgar,

that might be a trivial solution but have you already considered using the Inchi? It is supposed to be invariant to (most) forms of tautomers (https://www.inchi-trust.org/technical-faq-2/#6). Therefore tautomers should have the same InChI Code/Key.

Afterwards you would probably only need very specific rules for the tautomerisation and might save a lot of time by implementing only a few rules. Of course it depends on your use case, but for the detection that might help you to save some time.

Best,

Jennifer

On 24.07.19 16:37, Greg Landrum wrote:
Hi Ansgar,

On Wed, Jul 24, 2019 at 9:33 AM Schuffenhauer, Ansgar <ansgar.schuffenha...@novartis.com <mailto:ansgar.schuffenha...@novartis.com>> wrote:


    Very useful. Thank you. I’ll be able to deal with the SMARTS part
    myself


Great.

    But your answer has left me with another series of questions:

    You are saying that I am still using the old MolVS code for
    tautomer processing. That explains why the performance in terms of
    execution speed is not yet fully at the level I am used to expect
    from rdkit (no complaint meant, you just have set the bar quite
    high with rdkit in general).


You say such nice things. :-)
I haven't spent enough time looking at the tautomer enumeration code to know whether or not the poor performance you are seeing is something inherent in the process or an implementation artifact that can be fixed.

    Are you saying, that actually with respect to tautomer
    standardization there is no C++ port from Google Summer of Code? 
    Or is there something, which is not quite ready? Or is there a
    tautomer code in the C++ and I am just not using it? In this case,
    what would be the right function to use? What is your
    recommendation when it comes to tautomer standardization? .


Susan (who did the MolVS port) got a first version of the tautomer enumeration done but did not finish the scoring code that is necessary to get a "canonical" tautomer. Since the tautomer code in general wasn't finished, we didn't do a Python wrapper for any of it.

    For the sake of clarity I am looking for a tautomer standardizer,
    that does produce a uniform, canonical tautomer, I am not asking
    for the pyhsical.-chemical right, that is lowest energy energy
    one, as this is a task I assume no simple rule based tautomer
    standardizer can perform.


At the moment the only real option is to stick with the Python-based code.
Completing the tautomer enumeration/canonicalization work is something that is on my ToDo list, but it hasn't managed to bubble up to the top. Having people in the community[1] requesting it helps raise the priority.

-greg
[1] Particularly people in the community who also happen to work for companies that have RDKit support contracts.

    Is the C++ port everything that is in
    rdkit.Chem.MolStandardize.rdMolStandardize?

    Best regards

    Ansgar

    *Ansgar Schuffenhauer*

    Senior Investigator I

    T +41 79 608 9063

    ansgar.schuffenha...@novartis.com
    <mailto:ansgar.schuffenha...@novartis.com>__

    *Novartis Pharma AG*

    NIBR

    *From:*Greg Landrum <greg.land...@gmail.com
    <mailto:greg.land...@gmail.com>>
    *Sent:* Dienstag, 23. Juli 2019 14:43
    *To:* Schuffenhauer, Ansgar <ansgar.schuffenha...@novartis.com
    <mailto:ansgar.schuffenha...@novartis.com>>
    *Cc:* rdkit-discuss@lists.sourceforge.net
    <mailto:rdkit-discuss@lists.sourceforge.net>
    *Subject:* Re: rdkit.MolStandardize tautomer

    Hi Ansgar,

    This is still using the MolVS tautomer-handling code since we
    didn't finish the canonicalization part during last year's Google
    Summer of Code.[1]

    That means it's not using the parameter file that you found. The
    rules that are used are here:

    
https://github.com/rdkit/rdkit/blob/master/rdkit/Chem/MolStandardize/tautomer.py
    
<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_rdkit_rdkit_blob_master_rdkit_Chem_MolStandardize_tautomer.py&d=DwMFaQ&c=ZbgFmJjg4pdtrnL2HUJUDw&r=5QXEEnQo9VkJH7cIXFb_E4UmFhbbILws-P-WlR4_pzpv_6dQk_-xFQGH00p03i-I&m=BnaAWSqyTt4tDiaikrUZmNMgOhNeWPi73bwdA9j4-T8&s=d5ldgC12XWgrCnPLSa9hZ9_B2zYRiw8riuv93BTPsz0&e=>

    You can change those at runtime, but it you need to be careful to
    properly re-import modules after doing so. Here's an example
    showing how to do that:

    https://gist.github.com/greglandrum/4ac2b4e7f8c61e25836e106467aef150
    
<https://urldefense.proofpoint.com/v2/url?u=https-3A__gist.github.com_greglandrum_4ac2b4e7f8c61e25836e106467aef150&d=DwMFaQ&c=ZbgFmJjg4pdtrnL2HUJUDw&r=5QXEEnQo9VkJH7cIXFb_E4UmFhbbILws-P-WlR4_pzpv_6dQk_-xFQGH00p03i-I&m=BnaAWSqyTt4tDiaikrUZmNMgOhNeWPi73bwdA9j4-T8&s=Q-EeeeZdfD5bjCY_U7PrtfL5iHexxeP7faW30BsozuM&e=>

    I'm not going to claim that the SMARTS which I constructed to
    change the 1,3 (thio)ketol/enol is the right one, but it does at
    least show how to make the changes and reload the standardize
    module so that it takes effect.

    I hope this helps,

    -greg

    [1] and I haven't made it a priority because I dread the "no,
    that's not the right canonical tautomer" arguments that will ensue

    On Tue, Jul 23, 2019 at 8:54 AM Schuffenhauer, Ansgar
    <ansgar.schuffenha...@novartis.com
    <mailto:ansgar.schuffenha...@novartis.com>> wrote:

        Hi Greg

        Thanks for your quick answer. What I am doing is essentially
        the following:

        from rdkit.Chem import MolStandardize

        my_standardizer = MolStandardize.standardize.Standardizer()

        standard_tautomer = my_standardizer.tautomer_parent(input_mol)

        I assume that at the stage I construct my_standardizer  there
        would be some opportunity slip in an alternative configuration
        info

        By the way, I think also that one of the two cases of
        vanishing stereo-chemistry reported in
        https://github.com/rdkit/rdkit/issues/2363
        
<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_rdkit_rdkit_issues_2363&d=DwMFaQ&c=ZbgFmJjg4pdtrnL2HUJUDw&r=5QXEEnQo9VkJH7cIXFb_E4UmFhbbILws-P-WlR4_pzpv_6dQk_-xFQGH00p03i-I&m=BnaAWSqyTt4tDiaikrUZmNMgOhNeWPi73bwdA9j4-T8&s=bALfjkh7w44Zp0A885uv77DRWnreozm-_FML9XdeW60&e=>


        is caused by an overly eager keto/enol tautomerizer.

        Best regards

        Ansgar

        *Ansgar Schuffenhauer*

        Senior Investigator I

        T +41 79 608 9063

        ansgar.schuffenha...@novartis.com
        <mailto:ansgar.schuffenha...@novartis.com>

        *Novartis Pharma AG*

        NIBR

        *From:*Greg Landrum <greg.land...@gmail.com
        <mailto:greg.land...@gmail.com>>
        *Sent:* Montag, 22. Juli 2019 17:42
        *To:* Schuffenhauer, Ansgar <ansgar.schuffenha...@novartis.com
        <mailto:ansgar.schuffenha...@novartis.com>>
        *Cc:* rdkit-discuss@lists.sourceforge.net
        <mailto:rdkit-discuss@lists.sourceforge.net>
        *Subject:* Re: [Rdkit-discuss] Rdkit-discuss Digest, Vol 141,
        Issue 16

        Hi Ansgar,

        It is possible to specify the tautomer parameter file that is
        used, but in order for me to explain how, I need to know how
        you are currently using the code to enumerate tautomers (i.e.
        which function you are calling).

        As for the format: it's tab-delimited and the first entry is
        the name. The "r/f" flag is an indicator of which direction
        the transform is going that is just there to make the name unique.

        In the SMARTS the first atom is the one with the mobile H and
        the last atom is where it should be moved to.

        -greg

        On Mon, Jul 22, 2019 at 3:08 PM Schuffenhauer, Ansgar
        <ansgar.schuffenha...@novartis.com
        <mailto:ansgar.schuffenha...@novartis.com>> wrote:

            Dear all

            For the standardizer module (Chem.MolStandardize), what is
            the best way to change some of the tautomerizer rules?
            There is a data file in
            share/RDKit/Data/Molstandardize/tautomerTransforms.in
            which I assume to define the default.

            //      Name    SMARTS  Bonds  Charges
            1,3 (thio)keto/enol f  [CX4!H0]-[C]=[O,S,Se,Te;X1]
            1,3 (thio)keto/enol r  [O,S,Se,Te;X2!H0]-[C]=[C]
            1,5 (thio)keto/enol f
             [CX4,NX3;!H0]-[C]=[C][CH0]=[O,S,Se,Te;X1]
            1,5 (thio)keto/enol r  [O,S,Se,Te;X2!H0]-[CH0]=[C]-[C]=[C,N]
            ...

            Now my questions are
            1. What is the Syntax of this file? What does the "f" and
            the "r" stand for? Do the smarts have to start with the
            atom carrying the mobile H?
            2. How can I instruct rdkit not to use this default file,
            but the one supplied by the user.

            The background for this question that the smarts for
            keto/enol seems to be a bit too generic, as it catches
            also the alpha C-atoms of carboxylic acids and amides.
            Generation of tautomers here leads to a epimerization of
            stereo-centers in alpha positions of carboxylic acids and
            amides. That appears odd to me, as such stereo-centers are
            quite stable (in contrast to those of "real" ketones and
            aldehydes).


            Best regards

            Ansgar

            Ansgar Schuffenhauer
            Senior Investigator I
            T +41 79 608 9063
            ansgar.schuffenha...@novartis.com
            <mailto:ansgar.schuffenha...@novartis.com>

            Novartis Pharma AG
            NIBR



            _______________________________________________
            Rdkit-discuss mailing list
            Rdkit-discuss@lists.sourceforge.net
            <mailto:Rdkit-discuss@lists.sourceforge.net>
            https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
            
<https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.sourceforge.net_lists_listinfo_rdkit-2Ddiscuss&d=DwMFaQ&c=ZbgFmJjg4pdtrnL2HUJUDw&r=5QXEEnQo9VkJH7cIXFb_E4UmFhbbILws-P-WlR4_pzpv_6dQk_-xFQGH00p03i-I&m=uiXOLxD_7MgeeA9MyeUBlDB3ufzf53oBws3smVh4cc8&s=L4Bzk6_VPaAqyj_iM8_9rz9diujKH9rSgsNrvBa5958&e=>



_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to