Hi Susan,

I haven't looked at the Sgroup or unsaturation flags yet (I will try to do
this later today), but a word on the aromaticity/kekulization.

One of the things which is going wrong here at the cartridge level is that
qmol_from_ctab() is sanitizing the molecules it reads in. This is not
correct. qmol_from_ctab is intended to produce a query and queries should
not be sanitized. I will get that fixed and, assuming it doesn't turn up
other problems, we should be able to get that into the next patch release.
Here's that bug report:
https://github.com/rdkit/rdkit/issues/4787

There's another potential issue with how bonds from CTABs are parsed,
inspired by your first message, which we're discussing here:
https://github.com/rdkit/rdkit/issues/4785

Thanks for the very detailed descriptions of the problem!
-greg



On Fri, Dec 10, 2021 at 11:11 AM Susan Leung <susanhle...@gmail.com> wrote:

> Hi Paolo,
>
>
> Thanks very much for filing the bug and for offering the Python
> preprocessing solution.
>
>
> I actually have a few more CTABs that are not valid. One of which raises a
> kekulisation error and like the previous non-ring aromaton atom problem,
> there is an alternative SMARTS query that can be written. I suspect that
> you might suggest some Python preprocessing for this and converting to
> SMARTS?
>
>
> The other two errors, I don’t think are to do with sanitization but
> possibly due to the way the CTAB is read. One has multiple atoms with the
> unsaturated flag turned on. The other has a SGROUP defined.
>
> Please see below, where I tried to summarize and I again attach a ipynb if
> it helps.
>
>
> Thanks very much,
>
>
> Susan
>
>
> Example 2: Kekulization
>
> I want a query CTAB to match the following two tautomers.
>
> sm1 = 'Cc1c[nH]nc1C'
>
> sm2 = 'Cc1cn[nH]c1C'
>
>
>
> I would like to use the following CTAB as the query but it is not valid ,
> I’m guessing it’s because of kekulization, which is the error produced when
> doing MolFromMolBlock:
>
> ctab = """
>
>   ACCLDraw12082111532D
>
>
>
>   7  7  0  0  0  0  0  0  0  0999 V2000
>
>     9.2840  -12.1344    0.0000 C   0  0  3  0  0  0  0  0  0  0  0  0
>
>    10.2367  -11.4422    0.0000 C   0  0  3  0  0  0  0  0  0  0  0  0
>
>     9.8729  -10.3217    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
>
>     8.6950  -10.3217    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
>
>     8.3309  -11.4422    0.0000 C   0  0  3  0  0  0  0  0  0  0  0  0
>
>     7.1932  -11.7471    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
>
>     9.2840  -13.3122    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
>
>   2  1  4  0  0  0  0
>
>   2  3  4  0  0  0  0
>
>   3  4  4  0  0  0  0
>
>   4  5  4  0  0  0  0
>
>   5  1  4  0  0  0  0
>
>   5  6  1  0  0  0  0
>
>   7  1  1  0  0  0  0
>
> M  END
>
> """
>
> select is_valid_ctab('{ctab}')
>
>
>
> Returns False
>
> I can make an alternative valid CTAB with a hydrogen on one of the
> nitrogens that is valid, but then it doesn’t match both sm1 and sm2.
>
> ctab_fixed = """
>
>   ACCLDraw12082111272D
>
>
>
>   8  8  0  0  0  0  0  0  0  0999 V2000
>
>    11.6590  -11.9469    0.0000 C   0  0  3  0  0  0  0  0  0  0  0  0
>
>    12.6117  -11.2547    0.0000 C   0  0  3  0  0  0  0  0  0  0  0  0
>
>    12.2479  -10.1342    0.0000 N   0  0  3  0  0  0  0  0  0  0  0  0
>
>    11.0700  -10.1342    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
>
>    10.7059  -11.2547    0.0000 C   0  0  3  0  0  0  0  0  0  0  0  0
>
>     9.5682  -11.5596    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
>
>    11.6590  -13.1247    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
>
>    12.8368   -9.1142    0.0000 H   0  0  0  0  0  0  0  0  0  0  0  0
>
>   2  1  4  0  0  0  0
>
>   2  3  4  0  0  0  0
>
>   3  4  4  0  0  0  0
>
>   4  5  4  0  0  0  0
>
>   5  1  4  0  0  0  0
>
>   5  6  1  0  0  0  0
>
>   7  1  1  0  0  0  0
>
>   3  8  1  0  0  0  0
>
> M  END
>
> """
>
>
>
> select mol_from_smiles('{sm1}') @> qmol_from_ctab('{ctab_fixed}
>
>
>
> Returns True
>
> select mol_from_smiles('{sm2}') @> qmol_from_ctab('{ctab_fixed}')
>
>
>
> Returns False
>
>
>
> However, I can make a qmol from SMARTS that can match with both:
>
> alt_smarts = '[#6]1(:[#6]:[#7]:[#7]:[#6]:1-[#6])-[#6]'
>
>
>
> select mol_from_smiles('{sm1}') @> qmol_from_smarts('{alt_smarts}')
>
>
>
>  select mol_from_smiles('{sm2}') @> qmol_from_smarts('{alt_smarts}'
>
>
>
> Return True for both.
>
> _______
>
> Example 3 Unsaturated :
>
> How does RDKit handle the M  UNS line? It can’t seem to handle this CTAB
> for example, where multiple atoms (atom 8 and atom 9) have the unsaturated
> flag on.
>
> ctab_og = """
>
>   ACCLDraw12082113482D
>
>
>
>   9  9  0  0  0  0  0  0  0  0999 V2000
>
>     2.6030  -22.3675    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
>
>     3.6258  -21.7773    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
>
>     4.6447  -22.3669    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
>
>     4.6447  -23.5480    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
>
>     3.6283  -24.1374    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
>
>     2.6030  -23.5533    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
>
>     3.6258  -20.5962    0.0000 N   0  0  3  0  0  0  0  0  0  0  0  0
>
>     2.6029  -20.0057    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
>
>     4.6486  -20.0057    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
>
>   2  1  2  0  0  0  0
>
>   3  2  1  0  0  0  0
>
>   4  3  2  0  0  0  0
>
>   5  4  1  0  0  0  0
>
>   1  6  1  0  0  0  0
>
>   6  5  2  0  0  0  0
>
>   2  7  1  0  0  0  0
>
>   7  8  1  0  0  0  0
>
>   7  9  1  0  0  0  0
>
> M  UNS  2   8   1   9   1
>
> M  END
>
> """
>
> ctab_fixed = """
>
>   ACCLDraw12082113482D
>
>
>
>   9  9  0  0  0  0  0  0  0  0999 V2000
>
>     2.6030  -22.3675    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
>
>     3.6258  -21.7773    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
>
>     4.6447  -22.3669    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
>
>     4.6447  -23.5480    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
>
>     3.6283  -24.1374    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
>
>     2.6030  -23.5533    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
>
>     3.6258  -20.5962    0.0000 N   0  0  3  0  0  0  0  0  0  0  0  0
>
>     2.6029  -20.0057    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
>
>     4.6486  -20.0057    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
>
>   2  1  2  0  0  0  0
>
>   3  2  1  0  0  0  0
>
>   4  3  2  0  0  0  0
>
>   5  4  1  0  0  0  0
>
>   1  6  1  0  0  0  0
>
>   6  5  2  0  0  0  0
>
>   2  7  1  0  0  0  0
>
>   7  8  1  0  0  0  0
>
>   7  9  1  0  0  0  0
>
> M  UNS  1   8   1
>
> M  END
>
>
>
> """
>
>
>
> select is_valid_ctab('{ctab_og}')
>
> select is_valid_ctab('{ctab_fixed}')
>
> Returns False and True respectively
>
> MolFromMolBlock(ctab_og) gives warning Value 9 is not supported as an
> unsaturation query (only 0 and 1 are allowed). line: 23
>
> *Example 4 – Sgroup* :
>
> I can’t work out why RDKit cannot accept this CTAB. I have other cases
> that have Sgroups so I know RDKit can handle them. If I switch the order of
> LABEL and CSTATE in the SGROUP block, it is valid. And if I remove the
> SGROUP block entirely it also reads fine.
>
> ctab_og = """
>
>   ACCLDraw12082117492D
>
>
>
>   0  0  0     0  0            999 V3000
>
> M  V30 BEGIN CTAB
>
> M  V30 COUNTS 9 9 1 0 0
>
> M  V30 BEGIN ATOM
>
> M  V30 1 C 27.8908 -28.8436 0 0
>
> M  V30 2 C 27.2981 -29.8716 0 0
>
> M  V30 3 C 29.072 -28.8436 0 0
>
> M  V30 4 C 27.8937 -30.8881 0 0
>
> M  V30 5 C 29.0691 -30.8881 0 0
>
> M  V30 6 C 29.6604 -29.8644 0 0
>
> M  V30 7 N 29.6604 -31.9103 0 0 CHG=1
>
> M  V30 8 O 29.0691 -32.934 0 0
>
> M  V30 9 O 30.8416 -31.9103 0 0 CHG=-1
>
> M  V30 END ATOM
>
> M  V30 BEGIN BOND
>
> M  V30 1 1 1 2
>
> M  V30 2 2 3 1
>
> M  V30 3 2 2 4
>
> M  V30 4 1 4 5
>
> M  V30 5 1 6 3
>
> M  V30 6 2 6 5
>
> M  V30 7 1 5 7
>
> M  V30 8 2 7 8
>
> M  V30 9 1 7 9
>
> M  V30 END BOND
>
> M  V30 BEGIN SGROUP
>
> M  V30 1 SUP 0  ATOMS=(3 7 8 9) XBONDS=(1 7) LABEL="NO2" CSTATE=(4 7
>  -
>
> M  V30  -0.4100   0.7100   0.0000)
>
> M  V30 END SGROUP
>
> M  V30 END CTAB
>
> M  END
>
> """
>
> ctab_fixed = """
>
>   ACCLDraw12082117492D
>
>
>
>   0  0  0     0  0            999 V3000
>
> M  V30 BEGIN CTAB
>
> M  V30 COUNTS 9 9 1 0 0
>
> M  V30 BEGIN ATOM
>
> M  V30 1 C 27.8908 -28.8436 0 0
>
> M  V30 2 C 27.2981 -29.8716 0 0
>
> M  V30 3 C 29.072 -28.8436 0 0
>
> M  V30 4 C 27.8937 -30.8881 0 0
>
> M  V30 5 C 29.0691 -30.8881 0 0
>
> M  V30 6 C 29.6604 -29.8644 0 0
>
> M  V30 7 N 29.6604 -31.9103 0 0 CHG=1
>
> M  V30 8 O 29.0691 -32.934 0 0
>
> M  V30 9 O 30.8416 -31.9103 0 0 CHG=-1
>
> M  V30 END ATOM
>
> M  V30 BEGIN BOND
>
> M  V30 1 1 1 2
>
> M  V30 2 2 3 1
>
> M  V30 3 2 2 4
>
> M  V30 4 1 4 5
>
> M  V30 5 1 6 3
>
> M  V30 6 2 6 5
>
> M  V30 7 1 5 7
>
> M  V30 8 2 7 8
>
> M  V30 9 1 7 9
>
> M  V30 END BOND
>
> M  V30 BEGIN SGROUP
>
> M  V30 1 SUP 0  ATOMS=(3 7 8 9) XBONDS=(1 7) CSTATE=(4 7        -
>
> M  V30  -0.4100   0.7100   0.0000) LABEL="NO2"
>
> M  V30 END SGROUP
>
> M  V30 END CTAB
>
> M  END
>
> """
>
>
>
> select is_valid_ctab('{ctab_og}')
>
> Returns False
>
> select is_valid_ctab('{ctab_fixed}')
>
> Returns True
>
>
> On Thu, Dec 9, 2021 at 10:06 PM Paolo Tosco <paolo.tosco.m...@gmail.com>
> wrote:
>
>> Hi Susan,
>>
>> that looks like a bug in the way the MDL query is parsed; I have filed it
>> here:
>> https://github.com/rdkit/rdkit/issues/4785
>>
>> If you can afford doing some Python massaging to your CTAB queries and
>> converting them to SMARTS before submitting them to PostgreSQL when they
>> fail sanitization, the following should work:
>>
>> mol_og = Chem.MolFromMolBlock(ctab_og, sanitize=False)
>> try:
>>     Chem.SanitizeMol(mol_og)
>>     cur = conn.cursor()
>>     cur.execute(f"""select mol_from_smiles('{sm1}') @>
>> qmol_from_ctab('{ctab_og}')""")
>>     rows = cur.fetchall()
>>     print(rows)
>> except Chem.AtomKekulizeException as e:
>>     if re.match(r"non-ring atom \d+ marked aromatic", str(e)):
>>         Chem.FastFindRings(mol_og)
>>         rwmol_og = Chem.RWMol(mol_og)
>>         for a in mol_og.GetAtoms():
>>             if a.GetIsAromatic() and not a.IsInRing():
>>                 rwmol_og.ReplaceAtom(a.GetIdx(),
>> rdqueries.IsAromaticQueryAtom())
>>         try:
>>             Chem.SanitizeMol(rwmol_og)
>>             smarts_og = Chem.MolToSmarts(rwmol_og)
>>             cur = conn.cursor()
>>             cur.execute(f"""select mol_from_smiles('{sm1}') @>
>> mol_from_smarts('{smarts_og}')""")
>>             rows = cur.fetchall()
>>             print(rows)
>>         except:
>>             ...
>>
>> HTH, cheers
>> p.
>>
>>
>> On Thu, Dec 9, 2021 at 5:32 PM Susan Leung <susanhle...@gmail.com> wrote:
>>
>>> Hi all,
>>>
>>>
>>>
>>> I am trying to do some substructure queries using the RDKit PostgreSQL
>>> cartridge. Specifically, my queries substructure inputs are CTAB (not
>>> SMARTS) so I would like to use qmol_from_ctab. However, I have some
>>> problems with making valid query molecules with a few CTABs.
>>>
>>>
>>>
>>> In this query, I try to use a CTAB to make a query to search for aryl
>>> boronate acid/ester. I can make an equivalent query using SMARTS but the
>>> CTAB is not valid.
>>>
>>>
>>> As far as I am aware, there's no warning message when using the SQL
>>> functions, so I use MolFromMolBlock from python and get "non-ring atom
>>> 0 marked aromatic" so I correct the aromatic bond type to double bond
>>> and the CTAB can be read in (but that's not the query I want). I am
>>> guessing that there may be additional validity checks / sanitization
>>> steps when executing qmol_from_ctab vs qmol_from_smarts? As far as I can
>>> see, there’s no flag in qmol_from_ctab.
>>>
>>>
>>>
>>> I describe the general problem below but also attach the ipynb (if it is
>>> useful) that uses psycopg2 to do the SQL , leaving out the database
>>> connection credentials.
>>>
>>>
>>>
>>> Many thanks,
>>>
>>>
>>>
>>> Susan
>>>
>>> __________________________________
>>>
>>>
>>>
>>> For example, I want to match an aromatic boronic acid:
>>>
>>> sm1 = 'OB(O)c1ccccc1'
>>>
>>>
>>>
>>> But the following CTAB isn’t valid. MolFromFromBlock returns non-ring
>>> atom marked aromatic error so I suspect it’s to do with that. Also changing
>>> the bond marked aromatic ‘4’ to a double bond ‘2’ makes the ctab valid.
>>>
>>> ctab_og = """Boronate acid/ester(aryl)
>>>
>>>   SciTegic12012112112D
>>>
>>>
>>>
>>>   5  4  0  0  0  0            999 V2000
>>>
>>>     1.7243   -2.7324    0.0000 A   0  0
>>>
>>>     2.7559   -2.1456    0.0000 C   0  0
>>>
>>>     3.7808   -2.7324    0.0000 B   0  0
>>>
>>>     4.8057   -2.1456    0.0000 O   0  0
>>>
>>>     3.7808   -3.9190    0.0000 O   0  0
>>>
>>>   1  2  4  0  0  1  0
>>>
>>>   2  3  1  0
>>>
>>>   3  4  1  0
>>>
>>>   3  5  1  0
>>>
>>> M  END
>>>
>>> > <Name>
>>>
>>> Boronate acid/ester(aryl)
>>>
>>>
>>>
>>> """
>>>
>>> ctab_fixed = """Boronate acid/ester(aryl)
>>>
>>>   SciTegic12012112112D
>>>
>>>
>>>
>>>   5  4  0  0  0  0            999 V2000
>>>
>>>     1.7243   -2.7324    0.0000 A   0  0
>>>
>>>     2.7559   -2.1456    0.0000 C   0  0
>>>
>>>     3.7808   -2.7324    0.0000 B   0  0
>>>
>>>     4.8057   -2.1456    0.0000 O   0  0
>>>
>>>     3.7808   -3.9190    0.0000 O   0  0
>>>
>>>   1  2  2  0  0  1  0
>>>
>>>   2  3  1  0
>>>
>>>   3  4  1  0
>>>
>>>   3  5  1  0
>>>
>>> M  END
>>>
>>> > <Name>
>>>
>>> Boronate acid/ester(aryl)
>>>
>>>
>>>
>>> """
>>>
>>> select is_valid_ctab('{ctab_og}')
>>>
>>>
>>>
>>> Returns False
>>>
>>> select is_valid_ctab('{ctab_fixed}')
>>>
>>> Returns True
>>>
>>>
>>>
>>> However, I can make a qmol using SMARTS match sm1. Is there of making
>>> the query CTAB valid so we don’t have to use SMARTS?
>>>
>>> select mol_from_smiles('{sm1}') @> qmol_from_ctab('{ctab_fixed}')
>>>
>>>
>>>
>>> Returns False
>>>
>>>
>>>
>>> select mol_from_smiles('{sm1}') @> qmol_from_smarts('{alt_smarts}')
>>>
>>>
>>>
>>> Returns True
>>> _______________________________________________
>>> Rdkit-discuss mailing list
>>> Rdkit-discuss@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>
>> _______________________________________________
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to