Hi Greg, Thanks very much for looking into this. I agree that qmol_from_ctab should not sanitize the molecule - that makes sense to me!
Thanks, Susan On Fri, Dec 10, 2021 at 10:18 AM Greg Landrum <greg.land...@gmail.com> wrote: > Hi Susan, > > I haven't looked at the Sgroup or unsaturation flags yet (I will try to do > this later today), but a word on the aromaticity/kekulization. > > One of the things which is going wrong here at the cartridge level is that > qmol_from_ctab() is sanitizing the molecules it reads in. This is not > correct. qmol_from_ctab is intended to produce a query and queries should > not be sanitized. I will get that fixed and, assuming it doesn't turn up > other problems, we should be able to get that into the next patch release. > Here's that bug report: > https://github.com/rdkit/rdkit/issues/4787 > > There's another potential issue with how bonds from CTABs are parsed, > inspired by your first message, which we're discussing here: > https://github.com/rdkit/rdkit/issues/4785 > > Thanks for the very detailed descriptions of the problem! > -greg > > > > On Fri, Dec 10, 2021 at 11:11 AM Susan Leung <susanhle...@gmail.com> > wrote: > >> Hi Paolo, >> >> >> Thanks very much for filing the bug and for offering the Python >> preprocessing solution. >> >> >> I actually have a few more CTABs that are not valid. One of which raises >> a kekulisation error and like the previous non-ring aromaton atom problem, >> there is an alternative SMARTS query that can be written. I suspect that >> you might suggest some Python preprocessing for this and converting to >> SMARTS? >> >> >> The other two errors, I don’t think are to do with sanitization but >> possibly due to the way the CTAB is read. One has multiple atoms with the >> unsaturated flag turned on. The other has a SGROUP defined. >> >> Please see below, where I tried to summarize and I again attach a ipynb >> if it helps. >> >> >> Thanks very much, >> >> >> Susan >> >> >> Example 2: Kekulization >> >> I want a query CTAB to match the following two tautomers. >> >> sm1 = 'Cc1c[nH]nc1C' >> >> sm2 = 'Cc1cn[nH]c1C' >> >> >> >> I would like to use the following CTAB as the query but it is not valid , >> I’m guessing it’s because of kekulization, which is the error produced when >> doing MolFromMolBlock: >> >> ctab = """ >> >> ACCLDraw12082111532D >> >> >> >> 7 7 0 0 0 0 0 0 0 0999 V2000 >> >> 9.2840 -12.1344 0.0000 C 0 0 3 0 0 0 0 0 0 0 0 0 >> >> 10.2367 -11.4422 0.0000 C 0 0 3 0 0 0 0 0 0 0 0 0 >> >> 9.8729 -10.3217 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 >> >> 8.6950 -10.3217 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 >> >> 8.3309 -11.4422 0.0000 C 0 0 3 0 0 0 0 0 0 0 0 0 >> >> 7.1932 -11.7471 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 >> >> 9.2840 -13.3122 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 >> >> 2 1 4 0 0 0 0 >> >> 2 3 4 0 0 0 0 >> >> 3 4 4 0 0 0 0 >> >> 4 5 4 0 0 0 0 >> >> 5 1 4 0 0 0 0 >> >> 5 6 1 0 0 0 0 >> >> 7 1 1 0 0 0 0 >> >> M END >> >> """ >> >> select is_valid_ctab('{ctab}') >> >> >> >> Returns False >> >> I can make an alternative valid CTAB with a hydrogen on one of the >> nitrogens that is valid, but then it doesn’t match both sm1 and sm2. >> >> ctab_fixed = """ >> >> ACCLDraw12082111272D >> >> >> >> 8 8 0 0 0 0 0 0 0 0999 V2000 >> >> 11.6590 -11.9469 0.0000 C 0 0 3 0 0 0 0 0 0 0 0 0 >> >> 12.6117 -11.2547 0.0000 C 0 0 3 0 0 0 0 0 0 0 0 0 >> >> 12.2479 -10.1342 0.0000 N 0 0 3 0 0 0 0 0 0 0 0 0 >> >> 11.0700 -10.1342 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 >> >> 10.7059 -11.2547 0.0000 C 0 0 3 0 0 0 0 0 0 0 0 0 >> >> 9.5682 -11.5596 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 >> >> 11.6590 -13.1247 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 >> >> 12.8368 -9.1142 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0 >> >> 2 1 4 0 0 0 0 >> >> 2 3 4 0 0 0 0 >> >> 3 4 4 0 0 0 0 >> >> 4 5 4 0 0 0 0 >> >> 5 1 4 0 0 0 0 >> >> 5 6 1 0 0 0 0 >> >> 7 1 1 0 0 0 0 >> >> 3 8 1 0 0 0 0 >> >> M END >> >> """ >> >> >> >> select mol_from_smiles('{sm1}') @> qmol_from_ctab('{ctab_fixed} >> >> >> >> Returns True >> >> select mol_from_smiles('{sm2}') @> qmol_from_ctab('{ctab_fixed}') >> >> >> >> Returns False >> >> >> >> However, I can make a qmol from SMARTS that can match with both: >> >> alt_smarts = '[#6]1(:[#6]:[#7]:[#7]:[#6]:1-[#6])-[#6]' >> >> >> >> select mol_from_smiles('{sm1}') @> qmol_from_smarts('{alt_smarts}') >> >> >> >> select mol_from_smiles('{sm2}') @> qmol_from_smarts('{alt_smarts}' >> >> >> >> Return True for both. >> >> _______ >> >> Example 3 Unsaturated : >> >> How does RDKit handle the M UNS line? It can’t seem to handle this CTAB >> for example, where multiple atoms (atom 8 and atom 9) have the unsaturated >> flag on. >> >> ctab_og = """ >> >> ACCLDraw12082113482D >> >> >> >> 9 9 0 0 0 0 0 0 0 0999 V2000 >> >> 2.6030 -22.3675 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 >> >> 3.6258 -21.7773 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 >> >> 4.6447 -22.3669 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 >> >> 4.6447 -23.5480 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 >> >> 3.6283 -24.1374 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 >> >> 2.6030 -23.5533 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 >> >> 3.6258 -20.5962 0.0000 N 0 0 3 0 0 0 0 0 0 0 0 0 >> >> 2.6029 -20.0057 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 >> >> 4.6486 -20.0057 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 >> >> 2 1 2 0 0 0 0 >> >> 3 2 1 0 0 0 0 >> >> 4 3 2 0 0 0 0 >> >> 5 4 1 0 0 0 0 >> >> 1 6 1 0 0 0 0 >> >> 6 5 2 0 0 0 0 >> >> 2 7 1 0 0 0 0 >> >> 7 8 1 0 0 0 0 >> >> 7 9 1 0 0 0 0 >> >> M UNS 2 8 1 9 1 >> >> M END >> >> """ >> >> ctab_fixed = """ >> >> ACCLDraw12082113482D >> >> >> >> 9 9 0 0 0 0 0 0 0 0999 V2000 >> >> 2.6030 -22.3675 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 >> >> 3.6258 -21.7773 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 >> >> 4.6447 -22.3669 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 >> >> 4.6447 -23.5480 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 >> >> 3.6283 -24.1374 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 >> >> 2.6030 -23.5533 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 >> >> 3.6258 -20.5962 0.0000 N 0 0 3 0 0 0 0 0 0 0 0 0 >> >> 2.6029 -20.0057 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 >> >> 4.6486 -20.0057 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 >> >> 2 1 2 0 0 0 0 >> >> 3 2 1 0 0 0 0 >> >> 4 3 2 0 0 0 0 >> >> 5 4 1 0 0 0 0 >> >> 1 6 1 0 0 0 0 >> >> 6 5 2 0 0 0 0 >> >> 2 7 1 0 0 0 0 >> >> 7 8 1 0 0 0 0 >> >> 7 9 1 0 0 0 0 >> >> M UNS 1 8 1 >> >> M END >> >> >> >> """ >> >> >> >> select is_valid_ctab('{ctab_og}') >> >> select is_valid_ctab('{ctab_fixed}') >> >> Returns False and True respectively >> >> MolFromMolBlock(ctab_og) gives warning Value 9 is not supported as an >> unsaturation query (only 0 and 1 are allowed). line: 23 >> >> *Example 4 – Sgroup* : >> >> I can’t work out why RDKit cannot accept this CTAB. I have other cases >> that have Sgroups so I know RDKit can handle them. If I switch the order of >> LABEL and CSTATE in the SGROUP block, it is valid. And if I remove the >> SGROUP block entirely it also reads fine. >> >> ctab_og = """ >> >> ACCLDraw12082117492D >> >> >> >> 0 0 0 0 0 999 V3000 >> >> M V30 BEGIN CTAB >> >> M V30 COUNTS 9 9 1 0 0 >> >> M V30 BEGIN ATOM >> >> M V30 1 C 27.8908 -28.8436 0 0 >> >> M V30 2 C 27.2981 -29.8716 0 0 >> >> M V30 3 C 29.072 -28.8436 0 0 >> >> M V30 4 C 27.8937 -30.8881 0 0 >> >> M V30 5 C 29.0691 -30.8881 0 0 >> >> M V30 6 C 29.6604 -29.8644 0 0 >> >> M V30 7 N 29.6604 -31.9103 0 0 CHG=1 >> >> M V30 8 O 29.0691 -32.934 0 0 >> >> M V30 9 O 30.8416 -31.9103 0 0 CHG=-1 >> >> M V30 END ATOM >> >> M V30 BEGIN BOND >> >> M V30 1 1 1 2 >> >> M V30 2 2 3 1 >> >> M V30 3 2 2 4 >> >> M V30 4 1 4 5 >> >> M V30 5 1 6 3 >> >> M V30 6 2 6 5 >> >> M V30 7 1 5 7 >> >> M V30 8 2 7 8 >> >> M V30 9 1 7 9 >> >> M V30 END BOND >> >> M V30 BEGIN SGROUP >> >> M V30 1 SUP 0 ATOMS=(3 7 8 9) XBONDS=(1 7) LABEL="NO2" CSTATE=(4 7 >> - >> >> M V30 -0.4100 0.7100 0.0000) >> >> M V30 END SGROUP >> >> M V30 END CTAB >> >> M END >> >> """ >> >> ctab_fixed = """ >> >> ACCLDraw12082117492D >> >> >> >> 0 0 0 0 0 999 V3000 >> >> M V30 BEGIN CTAB >> >> M V30 COUNTS 9 9 1 0 0 >> >> M V30 BEGIN ATOM >> >> M V30 1 C 27.8908 -28.8436 0 0 >> >> M V30 2 C 27.2981 -29.8716 0 0 >> >> M V30 3 C 29.072 -28.8436 0 0 >> >> M V30 4 C 27.8937 -30.8881 0 0 >> >> M V30 5 C 29.0691 -30.8881 0 0 >> >> M V30 6 C 29.6604 -29.8644 0 0 >> >> M V30 7 N 29.6604 -31.9103 0 0 CHG=1 >> >> M V30 8 O 29.0691 -32.934 0 0 >> >> M V30 9 O 30.8416 -31.9103 0 0 CHG=-1 >> >> M V30 END ATOM >> >> M V30 BEGIN BOND >> >> M V30 1 1 1 2 >> >> M V30 2 2 3 1 >> >> M V30 3 2 2 4 >> >> M V30 4 1 4 5 >> >> M V30 5 1 6 3 >> >> M V30 6 2 6 5 >> >> M V30 7 1 5 7 >> >> M V30 8 2 7 8 >> >> M V30 9 1 7 9 >> >> M V30 END BOND >> >> M V30 BEGIN SGROUP >> >> M V30 1 SUP 0 ATOMS=(3 7 8 9) XBONDS=(1 7) CSTATE=(4 7 - >> >> M V30 -0.4100 0.7100 0.0000) LABEL="NO2" >> >> M V30 END SGROUP >> >> M V30 END CTAB >> >> M END >> >> """ >> >> >> >> select is_valid_ctab('{ctab_og}') >> >> Returns False >> >> select is_valid_ctab('{ctab_fixed}') >> >> Returns True >> >> >> On Thu, Dec 9, 2021 at 10:06 PM Paolo Tosco <paolo.tosco.m...@gmail.com> >> wrote: >> >>> Hi Susan, >>> >>> that looks like a bug in the way the MDL query is parsed; I have filed >>> it here: >>> https://github.com/rdkit/rdkit/issues/4785 >>> >>> If you can afford doing some Python massaging to your CTAB queries and >>> converting them to SMARTS before submitting them to PostgreSQL when they >>> fail sanitization, the following should work: >>> >>> mol_og = Chem.MolFromMolBlock(ctab_og, sanitize=False) >>> try: >>> Chem.SanitizeMol(mol_og) >>> cur = conn.cursor() >>> cur.execute(f"""select mol_from_smiles('{sm1}') @> >>> qmol_from_ctab('{ctab_og}')""") >>> rows = cur.fetchall() >>> print(rows) >>> except Chem.AtomKekulizeException as e: >>> if re.match(r"non-ring atom \d+ marked aromatic", str(e)): >>> Chem.FastFindRings(mol_og) >>> rwmol_og = Chem.RWMol(mol_og) >>> for a in mol_og.GetAtoms(): >>> if a.GetIsAromatic() and not a.IsInRing(): >>> rwmol_og.ReplaceAtom(a.GetIdx(), >>> rdqueries.IsAromaticQueryAtom()) >>> try: >>> Chem.SanitizeMol(rwmol_og) >>> smarts_og = Chem.MolToSmarts(rwmol_og) >>> cur = conn.cursor() >>> cur.execute(f"""select mol_from_smiles('{sm1}') @> >>> mol_from_smarts('{smarts_og}')""") >>> rows = cur.fetchall() >>> print(rows) >>> except: >>> ... >>> >>> HTH, cheers >>> p. >>> >>> >>> On Thu, Dec 9, 2021 at 5:32 PM Susan Leung <susanhle...@gmail.com> >>> wrote: >>> >>>> Hi all, >>>> >>>> >>>> >>>> I am trying to do some substructure queries using the RDKit PostgreSQL >>>> cartridge. Specifically, my queries substructure inputs are CTAB (not >>>> SMARTS) so I would like to use qmol_from_ctab. However, I have some >>>> problems with making valid query molecules with a few CTABs. >>>> >>>> >>>> >>>> In this query, I try to use a CTAB to make a query to search for aryl >>>> boronate acid/ester. I can make an equivalent query using SMARTS but the >>>> CTAB is not valid. >>>> >>>> >>>> As far as I am aware, there's no warning message when using the SQL >>>> functions, so I use MolFromMolBlock from python and get "non-ring atom >>>> 0 marked aromatic" so I correct the aromatic bond type to double bond >>>> and the CTAB can be read in (but that's not the query I want). I am >>>> guessing that there may be additional validity checks / sanitization >>>> steps when executing qmol_from_ctab vs qmol_from_smarts? As far as I can >>>> see, there’s no flag in qmol_from_ctab. >>>> >>>> >>>> >>>> I describe the general problem below but also attach the ipynb (if it >>>> is useful) that uses psycopg2 to do the SQL , leaving out the database >>>> connection credentials. >>>> >>>> >>>> >>>> Many thanks, >>>> >>>> >>>> >>>> Susan >>>> >>>> __________________________________ >>>> >>>> >>>> >>>> For example, I want to match an aromatic boronic acid: >>>> >>>> sm1 = 'OB(O)c1ccccc1' >>>> >>>> >>>> >>>> But the following CTAB isn’t valid. MolFromFromBlock returns non-ring >>>> atom marked aromatic error so I suspect it’s to do with that. Also changing >>>> the bond marked aromatic ‘4’ to a double bond ‘2’ makes the ctab valid. >>>> >>>> ctab_og = """Boronate acid/ester(aryl) >>>> >>>> SciTegic12012112112D >>>> >>>> >>>> >>>> 5 4 0 0 0 0 999 V2000 >>>> >>>> 1.7243 -2.7324 0.0000 A 0 0 >>>> >>>> 2.7559 -2.1456 0.0000 C 0 0 >>>> >>>> 3.7808 -2.7324 0.0000 B 0 0 >>>> >>>> 4.8057 -2.1456 0.0000 O 0 0 >>>> >>>> 3.7808 -3.9190 0.0000 O 0 0 >>>> >>>> 1 2 4 0 0 1 0 >>>> >>>> 2 3 1 0 >>>> >>>> 3 4 1 0 >>>> >>>> 3 5 1 0 >>>> >>>> M END >>>> >>>> > <Name> >>>> >>>> Boronate acid/ester(aryl) >>>> >>>> >>>> >>>> """ >>>> >>>> ctab_fixed = """Boronate acid/ester(aryl) >>>> >>>> SciTegic12012112112D >>>> >>>> >>>> >>>> 5 4 0 0 0 0 999 V2000 >>>> >>>> 1.7243 -2.7324 0.0000 A 0 0 >>>> >>>> 2.7559 -2.1456 0.0000 C 0 0 >>>> >>>> 3.7808 -2.7324 0.0000 B 0 0 >>>> >>>> 4.8057 -2.1456 0.0000 O 0 0 >>>> >>>> 3.7808 -3.9190 0.0000 O 0 0 >>>> >>>> 1 2 2 0 0 1 0 >>>> >>>> 2 3 1 0 >>>> >>>> 3 4 1 0 >>>> >>>> 3 5 1 0 >>>> >>>> M END >>>> >>>> > <Name> >>>> >>>> Boronate acid/ester(aryl) >>>> >>>> >>>> >>>> """ >>>> >>>> select is_valid_ctab('{ctab_og}') >>>> >>>> >>>> >>>> Returns False >>>> >>>> select is_valid_ctab('{ctab_fixed}') >>>> >>>> Returns True >>>> >>>> >>>> >>>> However, I can make a qmol using SMARTS match sm1. Is there of making >>>> the query CTAB valid so we don’t have to use SMARTS? >>>> >>>> select mol_from_smiles('{sm1}') @> qmol_from_ctab('{ctab_fixed}') >>>> >>>> >>>> >>>> Returns False >>>> >>>> >>>> >>>> select mol_from_smiles('{sm1}') @> qmol_from_smarts('{alt_smarts}') >>>> >>>> >>>> >>>> Returns True >>>> _______________________________________________ >>>> Rdkit-discuss mailing list >>>> Rdkit-discuss@lists.sourceforge.net >>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >>>> >>> _______________________________________________ >> Rdkit-discuss mailing list >> Rdkit-discuss@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >> >
_______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss