[Apologies, resending, as my previous reply did not go to rdkit; added some more info, too].
Thank you, James. My mistake was to think that, as RGroupDecompose() is somehow able to tell that molecules with this literal substructure in my set (unsubstituted N): [cid:image004.png@01D95D5E.053F3E00] match the original core I used: [cid:image005.png@01D95D5E.053F3E00] it would then stick to the core I specified for the RGroup decomposition, not create its tautomer with different R group labels on it, to match the target molecule's pattern rather than the core pattern. But OK, I imagine there's a reason for this. I tried specifying core_mol with R labels on: [cid:image006.png@01D95D5E.053F3E00] --> then all the molecules with the alternative tautomeric form, even when N is unsubstituted, do not match :/ In practice, I think I must convert all N-unsubstituted molecules to the tautomeric form I want, before running RGroupDecompose(). I tried CanonicalTautomer(), and it does not do that consistently; actually, it converts more often the desired tautomer (NH attached to the benzene ring) into the other one. Probably need to do this via a reaction. Thanks again for your input. From: James Wallace <james.wall...@evotec.com<mailto:james.wall...@evotec.com>> Sent: 22 March 2023 14:20 To: Giovanni Tricarico <giovanni.tricar...@glpg.com<mailto:giovanni.tricar...@glpg.com>> Subject: RE: invalid core SMILES returned by RGroupDecompose You don't often get email from james.wall...@evotec.com<mailto:james.wall...@evotec.com>. Learn why this is important<https://aka.ms/LearnAboutSenderIdentification> I've found some similar behaviour with respect to the tautomer, but when part of my query molecule is a bridged ring. In that case, instead of matching the structure as presented, it matches the bridged ring as a whole, as well as matching smaller rings represented by the bridge. Being able to force a 'complete' match so to speak will help here. As for your core, I've experienced this before where the aromaticity check seems to fail around the presence of [nH] in that kind of structure confusing the Kekulize/dekekulize code. All I could do to work around it was to build the molecule with the added option sanitize=False, so: mol = Chem.MolFromSmiles("[nH]1c2c([*:5])c([*:6])c([*:7])c([*:1])c2c([*:2])n1[*:3]", sanitize=False) But that's not ideal. From: Giovanni Tricarico <giovanni.tricar...@glpg.com<mailto:giovanni.tricar...@glpg.com>> Sent: 22 March 2023 10:04 To: Rdkit-discuss@lists.sourceforge.net<mailto:Rdkit-discuss@lists.sourceforge.net> Subject: [Rdkit-discuss] invalid core SMILES returned by RGroupDecompose ALERT : This message originated outside of Evotec's network. BE CAUTIOUS before clicking any link or attachment. Hello, I tried out RGroupDecompose on a set of indazoles, using "c1ccc2[nH]ncc2c1" as core molecule. Most of them gave a valid core SMILES: n1c([*:2])c2c([*:1])c([*:7])c([*:6])c([*:5])c2n1[*:4] However, some gave this core SMILES: [nH]1c2c([*:5])c([*:6])c([*:7])c([*:1])c2c([*:2])n1[*:3] which rdkit itself then refuses to convert to a molecule (other software like Dotmatics Vortex does instead (?)). [cid:image007.png@01D95D5E.053F3E00] Any idea what may be going wrong? I noticed that the tautomeric form of the indazole ring is different in the molecules that originated the 'wrong' core, in particular the H (or other substituent) is on the nitrogen atom that is not attached to the benzene ring. [In fact, that also raises the question of why a tautomer of the original core was matched by RGroupDecompose, and how one would instead force the matching of the chosen tautomer only]. Thanks Giovanni Tricarico Principal Scientist Computational Chemistry [cid:image008.png@01D95D5E.053F3E00] Galapagos Generaal De Wittelaan L11 A3 2800 Mechelen Belgium T: +32 15 6514 30 www.glpg.com<https://eur05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fddec1-0-en-ctp.trendmicro.com%2Fwis%2Fclicktime%2Fv1%2Fquery%3Furl%3Dhttp%253a%252f%252fwww.glpg.com%26umid%3D5446a8aa-7b52-447d-8f31-5d6c34bce118%26auth%3D670e6529b563b7dbb42ee90dda0d50ae13dc637b-7d301ceda7a7fffd7f39b707b1db234a1d670c4c&data=05%7C01%7Cgiovanni.tricarico%40glpg.com%7Cd2b8ec427d2b47fc7c7b08db2ad82b30%7C627f3c33bccc48bba033c0a6521f7642%7C1%7C0%7C638150880169822616%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=unQhIedwwzpjVS1zGIPI5OMRck6huQsJCnyc14VFiRA%3D&reserved=0> This e-mail and its attachment(s) (if any) may contain confidential and/or proprietary information and is intended for its addressee(s) only. Any unauthorized use of the information contained herein (including, but not limited to, alteration, reproduction, communication, distribution or any other form of dissemination) is strictly prohibited. If you are not the intended addressee, please notify the originator promptly and delete this e-mail and its attachment(s) (if any) subsequently. Neither Galapagos nor any of its affiliates shall be liable for direct, special, indirect or consequential damages arising from alteration of the contents of this message (by a third party) or as a result of a virus being passed on. Please find our information on data protection here<https://eur05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.evotec.com%2Fen%2Fabout%2Fsite-information%2Fprivacy-policy&data=05%7C01%7Cgiovanni.tricarico%40glpg.com%7Cd2b8ec427d2b47fc7c7b08db2ad82b30%7C627f3c33bccc48bba033c0a6521f7642%7C1%7C0%7C638150880169822616%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=HQthypGPa9kn4brcOuLuR3Lwji7Pd%2Fp9rKDEKCEgfhI%3D&reserved=0>. Evotec (UK) Ltd is a limited company registered in England and Wales. Registration number:2674265. Registered Office: 114 Innovation Drive, Milton Park, Abingdon, Oxfordshire, OX14 4RZ, United Kingdom STATEMENT OF CONFIDENTIALITY. This email and any attachments may contain confidential, proprietary, privileged and/or private information. If received in error, please notify us immediately by reply email and then delete this email and any attachments from your system. Thank you. This e-mail and its attachment(s) (if any) may contain confidential and/or proprietary information and is intended for its addressee(s) only. Any unauthorized use of the information contained herein (including, but not limited to, alteration, reproduction, communication, distribution or any other form of dissemination) is strictly prohibited. If you are not the intended addressee, please notify the originator promptly and delete this e-mail and its attachment(s) (if any) subsequently. Neither Galapagos nor any of its affiliates shall be liable for direct, special, indirect or consequential damages arising from alteration of the contents of this message (by a third party) or as a result of a virus being passed on.
_______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss