Hi Greg,

Thanks very much for looking into this. I agree that qmol_from_ctab should
not sanitize the molecule - that makes sense to me!

Thanks,

Susan

On Fri, Dec 10, 2021 at 10:18 AM Greg Landrum <greg.land...@gmail.com>
wrote:

> Hi Susan,
>
> I haven't looked at the Sgroup or unsaturation flags yet (I will try to do
> this later today), but a word on the aromaticity/kekulization.
>
> One of the things which is going wrong here at the cartridge level is that
> qmol_from_ctab() is sanitizing the molecules it reads in. This is not
> correct. qmol_from_ctab is intended to produce a query and queries should
> not be sanitized. I will get that fixed and, assuming it doesn't turn up
> other problems, we should be able to get that into the next patch release.
> Here's that bug report:
> https://github.com/rdkit/rdkit/issues/4787
>
> There's another potential issue with how bonds from CTABs are parsed,
> inspired by your first message, which we're discussing here:
> https://github.com/rdkit/rdkit/issues/4785
>
> Thanks for the very detailed descriptions of the problem!
> -greg
>
>
>
> On Fri, Dec 10, 2021 at 11:11 AM Susan Leung <susanhle...@gmail.com>
> wrote:
>
>> Hi Paolo,
>>
>>
>> Thanks very much for filing the bug and for offering the Python
>> preprocessing solution.
>>
>>
>> I actually have a few more CTABs that are not valid. One of which raises
>> a kekulisation error and like the previous non-ring aromaton atom problem,
>> there is an alternative SMARTS query that can be written. I suspect that
>> you might suggest some Python preprocessing for this and converting to
>> SMARTS?
>>
>>
>> The other two errors, I don’t think are to do with sanitization but
>> possibly due to the way the CTAB is read. One has multiple atoms with the
>> unsaturated flag turned on. The other has a SGROUP defined.
>>
>> Please see below, where I tried to summarize and I again attach a ipynb
>> if it helps.
>>
>>
>> Thanks very much,
>>
>>
>> Susan
>>
>>
>> Example 2: Kekulization
>>
>> I want a query CTAB to match the following two tautomers.
>>
>> sm1 = 'Cc1c[nH]nc1C'
>>
>> sm2 = 'Cc1cn[nH]c1C'
>>
>>
>>
>> I would like to use the following CTAB as the query but it is not valid ,
>> I’m guessing it’s because of kekulization, which is the error produced when
>> doing MolFromMolBlock:
>>
>> ctab = """
>>
>>   ACCLDraw12082111532D
>>
>>
>>
>>   7  7  0  0  0  0  0  0  0  0999 V2000
>>
>>     9.2840  -12.1344    0.0000 C   0  0  3  0  0  0  0  0  0  0  0  0
>>
>>    10.2367  -11.4422    0.0000 C   0  0  3  0  0  0  0  0  0  0  0  0
>>
>>     9.8729  -10.3217    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
>>
>>     8.6950  -10.3217    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
>>
>>     8.3309  -11.4422    0.0000 C   0  0  3  0  0  0  0  0  0  0  0  0
>>
>>     7.1932  -11.7471    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
>>
>>     9.2840  -13.3122    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
>>
>>   2  1  4  0  0  0  0
>>
>>   2  3  4  0  0  0  0
>>
>>   3  4  4  0  0  0  0
>>
>>   4  5  4  0  0  0  0
>>
>>   5  1  4  0  0  0  0
>>
>>   5  6  1  0  0  0  0
>>
>>   7  1  1  0  0  0  0
>>
>> M  END
>>
>> """
>>
>> select is_valid_ctab('{ctab}')
>>
>>
>>
>> Returns False
>>
>> I can make an alternative valid CTAB with a hydrogen on one of the
>> nitrogens that is valid, but then it doesn’t match both sm1 and sm2.
>>
>> ctab_fixed = """
>>
>>   ACCLDraw12082111272D
>>
>>
>>
>>   8  8  0  0  0  0  0  0  0  0999 V2000
>>
>>    11.6590  -11.9469    0.0000 C   0  0  3  0  0  0  0  0  0  0  0  0
>>
>>    12.6117  -11.2547    0.0000 C   0  0  3  0  0  0  0  0  0  0  0  0
>>
>>    12.2479  -10.1342    0.0000 N   0  0  3  0  0  0  0  0  0  0  0  0
>>
>>    11.0700  -10.1342    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
>>
>>    10.7059  -11.2547    0.0000 C   0  0  3  0  0  0  0  0  0  0  0  0
>>
>>     9.5682  -11.5596    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
>>
>>    11.6590  -13.1247    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
>>
>>    12.8368   -9.1142    0.0000 H   0  0  0  0  0  0  0  0  0  0  0  0
>>
>>   2  1  4  0  0  0  0
>>
>>   2  3  4  0  0  0  0
>>
>>   3  4  4  0  0  0  0
>>
>>   4  5  4  0  0  0  0
>>
>>   5  1  4  0  0  0  0
>>
>>   5  6  1  0  0  0  0
>>
>>   7  1  1  0  0  0  0
>>
>>   3  8  1  0  0  0  0
>>
>> M  END
>>
>> """
>>
>>
>>
>> select mol_from_smiles('{sm1}') @> qmol_from_ctab('{ctab_fixed}
>>
>>
>>
>> Returns True
>>
>> select mol_from_smiles('{sm2}') @> qmol_from_ctab('{ctab_fixed}')
>>
>>
>>
>> Returns False
>>
>>
>>
>> However, I can make a qmol from SMARTS that can match with both:
>>
>> alt_smarts = '[#6]1(:[#6]:[#7]:[#7]:[#6]:1-[#6])-[#6]'
>>
>>
>>
>> select mol_from_smiles('{sm1}') @> qmol_from_smarts('{alt_smarts}')
>>
>>
>>
>>  select mol_from_smiles('{sm2}') @> qmol_from_smarts('{alt_smarts}'
>>
>>
>>
>> Return True for both.
>>
>> _______
>>
>> Example 3 Unsaturated :
>>
>> How does RDKit handle the M  UNS line? It can’t seem to handle this CTAB
>> for example, where multiple atoms (atom 8 and atom 9) have the unsaturated
>> flag on.
>>
>> ctab_og = """
>>
>>   ACCLDraw12082113482D
>>
>>
>>
>>   9  9  0  0  0  0  0  0  0  0999 V2000
>>
>>     2.6030  -22.3675    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
>>
>>     3.6258  -21.7773    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
>>
>>     4.6447  -22.3669    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
>>
>>     4.6447  -23.5480    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
>>
>>     3.6283  -24.1374    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
>>
>>     2.6030  -23.5533    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
>>
>>     3.6258  -20.5962    0.0000 N   0  0  3  0  0  0  0  0  0  0  0  0
>>
>>     2.6029  -20.0057    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
>>
>>     4.6486  -20.0057    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
>>
>>   2  1  2  0  0  0  0
>>
>>   3  2  1  0  0  0  0
>>
>>   4  3  2  0  0  0  0
>>
>>   5  4  1  0  0  0  0
>>
>>   1  6  1  0  0  0  0
>>
>>   6  5  2  0  0  0  0
>>
>>   2  7  1  0  0  0  0
>>
>>   7  8  1  0  0  0  0
>>
>>   7  9  1  0  0  0  0
>>
>> M  UNS  2   8   1   9   1
>>
>> M  END
>>
>> """
>>
>> ctab_fixed = """
>>
>>   ACCLDraw12082113482D
>>
>>
>>
>>   9  9  0  0  0  0  0  0  0  0999 V2000
>>
>>     2.6030  -22.3675    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
>>
>>     3.6258  -21.7773    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
>>
>>     4.6447  -22.3669    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
>>
>>     4.6447  -23.5480    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
>>
>>     3.6283  -24.1374    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
>>
>>     2.6030  -23.5533    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
>>
>>     3.6258  -20.5962    0.0000 N   0  0  3  0  0  0  0  0  0  0  0  0
>>
>>     2.6029  -20.0057    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
>>
>>     4.6486  -20.0057    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
>>
>>   2  1  2  0  0  0  0
>>
>>   3  2  1  0  0  0  0
>>
>>   4  3  2  0  0  0  0
>>
>>   5  4  1  0  0  0  0
>>
>>   1  6  1  0  0  0  0
>>
>>   6  5  2  0  0  0  0
>>
>>   2  7  1  0  0  0  0
>>
>>   7  8  1  0  0  0  0
>>
>>   7  9  1  0  0  0  0
>>
>> M  UNS  1   8   1
>>
>> M  END
>>
>>
>>
>> """
>>
>>
>>
>> select is_valid_ctab('{ctab_og}')
>>
>> select is_valid_ctab('{ctab_fixed}')
>>
>> Returns False and True respectively
>>
>> MolFromMolBlock(ctab_og) gives warning Value 9 is not supported as an
>> unsaturation query (only 0 and 1 are allowed). line: 23
>>
>> *Example 4 – Sgroup* :
>>
>> I can’t work out why RDKit cannot accept this CTAB. I have other cases
>> that have Sgroups so I know RDKit can handle them. If I switch the order of
>> LABEL and CSTATE in the SGROUP block, it is valid. And if I remove the
>> SGROUP block entirely it also reads fine.
>>
>> ctab_og = """
>>
>>   ACCLDraw12082117492D
>>
>>
>>
>>   0  0  0     0  0            999 V3000
>>
>> M  V30 BEGIN CTAB
>>
>> M  V30 COUNTS 9 9 1 0 0
>>
>> M  V30 BEGIN ATOM
>>
>> M  V30 1 C 27.8908 -28.8436 0 0
>>
>> M  V30 2 C 27.2981 -29.8716 0 0
>>
>> M  V30 3 C 29.072 -28.8436 0 0
>>
>> M  V30 4 C 27.8937 -30.8881 0 0
>>
>> M  V30 5 C 29.0691 -30.8881 0 0
>>
>> M  V30 6 C 29.6604 -29.8644 0 0
>>
>> M  V30 7 N 29.6604 -31.9103 0 0 CHG=1
>>
>> M  V30 8 O 29.0691 -32.934 0 0
>>
>> M  V30 9 O 30.8416 -31.9103 0 0 CHG=-1
>>
>> M  V30 END ATOM
>>
>> M  V30 BEGIN BOND
>>
>> M  V30 1 1 1 2
>>
>> M  V30 2 2 3 1
>>
>> M  V30 3 2 2 4
>>
>> M  V30 4 1 4 5
>>
>> M  V30 5 1 6 3
>>
>> M  V30 6 2 6 5
>>
>> M  V30 7 1 5 7
>>
>> M  V30 8 2 7 8
>>
>> M  V30 9 1 7 9
>>
>> M  V30 END BOND
>>
>> M  V30 BEGIN SGROUP
>>
>> M  V30 1 SUP 0  ATOMS=(3 7 8 9) XBONDS=(1 7) LABEL="NO2" CSTATE=(4 7
>>    -
>>
>> M  V30  -0.4100   0.7100   0.0000)
>>
>> M  V30 END SGROUP
>>
>> M  V30 END CTAB
>>
>> M  END
>>
>> """
>>
>> ctab_fixed = """
>>
>>   ACCLDraw12082117492D
>>
>>
>>
>>   0  0  0     0  0            999 V3000
>>
>> M  V30 BEGIN CTAB
>>
>> M  V30 COUNTS 9 9 1 0 0
>>
>> M  V30 BEGIN ATOM
>>
>> M  V30 1 C 27.8908 -28.8436 0 0
>>
>> M  V30 2 C 27.2981 -29.8716 0 0
>>
>> M  V30 3 C 29.072 -28.8436 0 0
>>
>> M  V30 4 C 27.8937 -30.8881 0 0
>>
>> M  V30 5 C 29.0691 -30.8881 0 0
>>
>> M  V30 6 C 29.6604 -29.8644 0 0
>>
>> M  V30 7 N 29.6604 -31.9103 0 0 CHG=1
>>
>> M  V30 8 O 29.0691 -32.934 0 0
>>
>> M  V30 9 O 30.8416 -31.9103 0 0 CHG=-1
>>
>> M  V30 END ATOM
>>
>> M  V30 BEGIN BOND
>>
>> M  V30 1 1 1 2
>>
>> M  V30 2 2 3 1
>>
>> M  V30 3 2 2 4
>>
>> M  V30 4 1 4 5
>>
>> M  V30 5 1 6 3
>>
>> M  V30 6 2 6 5
>>
>> M  V30 7 1 5 7
>>
>> M  V30 8 2 7 8
>>
>> M  V30 9 1 7 9
>>
>> M  V30 END BOND
>>
>> M  V30 BEGIN SGROUP
>>
>> M  V30 1 SUP 0  ATOMS=(3 7 8 9) XBONDS=(1 7) CSTATE=(4 7        -
>>
>> M  V30  -0.4100   0.7100   0.0000) LABEL="NO2"
>>
>> M  V30 END SGROUP
>>
>> M  V30 END CTAB
>>
>> M  END
>>
>> """
>>
>>
>>
>> select is_valid_ctab('{ctab_og}')
>>
>> Returns False
>>
>> select is_valid_ctab('{ctab_fixed}')
>>
>> Returns True
>>
>>
>> On Thu, Dec 9, 2021 at 10:06 PM Paolo Tosco <paolo.tosco.m...@gmail.com>
>> wrote:
>>
>>> Hi Susan,
>>>
>>> that looks like a bug in the way the MDL query is parsed; I have filed
>>> it here:
>>> https://github.com/rdkit/rdkit/issues/4785
>>>
>>> If you can afford doing some Python massaging to your CTAB queries and
>>> converting them to SMARTS before submitting them to PostgreSQL when they
>>> fail sanitization, the following should work:
>>>
>>> mol_og = Chem.MolFromMolBlock(ctab_og, sanitize=False)
>>> try:
>>>     Chem.SanitizeMol(mol_og)
>>>     cur = conn.cursor()
>>>     cur.execute(f"""select mol_from_smiles('{sm1}') @>
>>> qmol_from_ctab('{ctab_og}')""")
>>>     rows = cur.fetchall()
>>>     print(rows)
>>> except Chem.AtomKekulizeException as e:
>>>     if re.match(r"non-ring atom \d+ marked aromatic", str(e)):
>>>         Chem.FastFindRings(mol_og)
>>>         rwmol_og = Chem.RWMol(mol_og)
>>>         for a in mol_og.GetAtoms():
>>>             if a.GetIsAromatic() and not a.IsInRing():
>>>                 rwmol_og.ReplaceAtom(a.GetIdx(),
>>> rdqueries.IsAromaticQueryAtom())
>>>         try:
>>>             Chem.SanitizeMol(rwmol_og)
>>>             smarts_og = Chem.MolToSmarts(rwmol_og)
>>>             cur = conn.cursor()
>>>             cur.execute(f"""select mol_from_smiles('{sm1}') @>
>>> mol_from_smarts('{smarts_og}')""")
>>>             rows = cur.fetchall()
>>>             print(rows)
>>>         except:
>>>             ...
>>>
>>> HTH, cheers
>>> p.
>>>
>>>
>>> On Thu, Dec 9, 2021 at 5:32 PM Susan Leung <susanhle...@gmail.com>
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>>
>>>>
>>>> I am trying to do some substructure queries using the RDKit PostgreSQL
>>>> cartridge. Specifically, my queries substructure inputs are CTAB (not
>>>> SMARTS) so I would like to use qmol_from_ctab. However, I have some
>>>> problems with making valid query molecules with a few CTABs.
>>>>
>>>>
>>>>
>>>> In this query, I try to use a CTAB to make a query to search for aryl
>>>> boronate acid/ester. I can make an equivalent query using SMARTS but the
>>>> CTAB is not valid.
>>>>
>>>>
>>>> As far as I am aware, there's no warning message when using the SQL
>>>> functions, so I use MolFromMolBlock from python and get "non-ring atom
>>>> 0 marked aromatic" so I correct the aromatic bond type to double bond
>>>> and the CTAB can be read in (but that's not the query I want). I am
>>>> guessing that there may be additional validity checks / sanitization
>>>> steps when executing qmol_from_ctab vs qmol_from_smarts? As far as I can
>>>> see, there’s no flag in qmol_from_ctab.
>>>>
>>>>
>>>>
>>>> I describe the general problem below but also attach the ipynb (if it
>>>> is useful) that uses psycopg2 to do the SQL , leaving out the database
>>>> connection credentials.
>>>>
>>>>
>>>>
>>>> Many thanks,
>>>>
>>>>
>>>>
>>>> Susan
>>>>
>>>> __________________________________
>>>>
>>>>
>>>>
>>>> For example, I want to match an aromatic boronic acid:
>>>>
>>>> sm1 = 'OB(O)c1ccccc1'
>>>>
>>>>
>>>>
>>>> But the following CTAB isn’t valid. MolFromFromBlock returns non-ring
>>>> atom marked aromatic error so I suspect it’s to do with that. Also changing
>>>> the bond marked aromatic ‘4’ to a double bond ‘2’ makes the ctab valid.
>>>>
>>>> ctab_og = """Boronate acid/ester(aryl)
>>>>
>>>>   SciTegic12012112112D
>>>>
>>>>
>>>>
>>>>   5  4  0  0  0  0            999 V2000
>>>>
>>>>     1.7243   -2.7324    0.0000 A   0  0
>>>>
>>>>     2.7559   -2.1456    0.0000 C   0  0
>>>>
>>>>     3.7808   -2.7324    0.0000 B   0  0
>>>>
>>>>     4.8057   -2.1456    0.0000 O   0  0
>>>>
>>>>     3.7808   -3.9190    0.0000 O   0  0
>>>>
>>>>   1  2  4  0  0  1  0
>>>>
>>>>   2  3  1  0
>>>>
>>>>   3  4  1  0
>>>>
>>>>   3  5  1  0
>>>>
>>>> M  END
>>>>
>>>> > <Name>
>>>>
>>>> Boronate acid/ester(aryl)
>>>>
>>>>
>>>>
>>>> """
>>>>
>>>> ctab_fixed = """Boronate acid/ester(aryl)
>>>>
>>>>   SciTegic12012112112D
>>>>
>>>>
>>>>
>>>>   5  4  0  0  0  0            999 V2000
>>>>
>>>>     1.7243   -2.7324    0.0000 A   0  0
>>>>
>>>>     2.7559   -2.1456    0.0000 C   0  0
>>>>
>>>>     3.7808   -2.7324    0.0000 B   0  0
>>>>
>>>>     4.8057   -2.1456    0.0000 O   0  0
>>>>
>>>>     3.7808   -3.9190    0.0000 O   0  0
>>>>
>>>>   1  2  2  0  0  1  0
>>>>
>>>>   2  3  1  0
>>>>
>>>>   3  4  1  0
>>>>
>>>>   3  5  1  0
>>>>
>>>> M  END
>>>>
>>>> > <Name>
>>>>
>>>> Boronate acid/ester(aryl)
>>>>
>>>>
>>>>
>>>> """
>>>>
>>>> select is_valid_ctab('{ctab_og}')
>>>>
>>>>
>>>>
>>>> Returns False
>>>>
>>>> select is_valid_ctab('{ctab_fixed}')
>>>>
>>>> Returns True
>>>>
>>>>
>>>>
>>>> However, I can make a qmol using SMARTS match sm1. Is there of making
>>>> the query CTAB valid so we don’t have to use SMARTS?
>>>>
>>>> select mol_from_smiles('{sm1}') @> qmol_from_ctab('{ctab_fixed}')
>>>>
>>>>
>>>>
>>>> Returns False
>>>>
>>>>
>>>>
>>>> select mol_from_smiles('{sm1}') @> qmol_from_smarts('{alt_smarts}')
>>>>
>>>>
>>>>
>>>> Returns True
>>>> _______________________________________________
>>>> Rdkit-discuss mailing list
>>>> Rdkit-discuss@lists.sourceforge.net
>>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>>
>>> _______________________________________________
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to