You might find this link useful -
http://www.rdkit.org/docs/GettingStartedInPython.html#chemical-transformations
However, the issue in your case is SMARTS definitions. If one SMARTS
completely covers another one it would be difficult to understand is it
artifact or not.I think it might be reasonable to revise SMARTS to avoid
such overlapping or create a list of rules (maybe hierarchical) which
will define valid and not valid overlappings.
Pavel.
On 03/08/2017 06:32 PM, Chenyang Shi wrote:
Dear Hongbin,
I tried your method on a molecule, 4-Methylsalicylic acid
(CC1=CC(=C(C=C1)C(=O)O)O). I looped through all groups defined in
Joback method (using SMARTS), and used m.GetSubstructMatches to print
out all atom positions. The result is summarized in the table.
We can see there are duplicated counts--coming from COOH group. As
suggested by Hongbin, we can remove duplicated atoms by looking at
their positions--in this case, ((9),), ((7,8,),), ((7,),), and ((8,),)
are subsets of ((7,8,9)) from -COOH. Indeed we can get rid of these
duplicates. However, I also noticed that Atom (3,) from =C< (ring)
group is also a part of -OH (phenol) ((10,3),). If we apply the same
algorithm to remove duplicates, the =C<(ring) group will be only
counted twice instead of three times.
Greg, you mentioned as an alternative I can delete substructure using
chemical reaction method. It would be greatly appreciated if you could
show me (point me to) a simple example code, perhaps on a simple
molecule? I find myself at a loss when browsing the manual. I would
like to try also in that direction.
Thanks,
Chenyang
Inline image 1
On Mon, Mar 6, 2017 at 1:52 AM, Greg Landrum <[email protected]
<mailto:[email protected]>> wrote:
The solution that Hongbin proposes to the double-counting problem
is a good one. Just be sure to sort your substructure queries in
the right order so that the more complex ones come first.
Another thing you might think about is making your queries more
specific. For example, as you pointed out "[OH]" is very general
and matches parts of carboxylic acids and a number of other
functional groups. The RDKit has a set of fairly well tested
(though certainly not perfect) functional group definitions in
$RDBASE/Data/Functional_Group_Hierarchy.txt. The alcohol
definition from there looks like this:
[O;H1;$(O-!@[#6;!$(C=!@[O,N,S])])]
-greg
On Mon, Mar 6, 2017 at 7:20 AM, 杨弘宾 <[email protected]
<mailto:[email protected]>> wrote:
Hi, Chenyang,
You don't need to delete the substructure from the molecule.
Just check whehter the mapped atoms have been matched. For
example:
m = Chem.MolFromSmiles('CC(=O)O')
OH = Chem.MolFromSmarts('[OH]')
COOH = Chem.MolFromSmarts('C(O)=O')
m.GetSubstructMatches(OH)
>>((3,),)
m.GetSubstructMatchs(COOH)
>>((1, 3, 2),)
Since atom "3" has been already matched, it should be ignored.
So you can create a "set" to record the matched atoms to avoid
repetitive count.
------------------------------------------------------------------------
Hongbin Yang 杨弘宾
*From:* Chenyang Shi <mailto:[email protected]>
*Date:* 2017-03-06 14:04
*To:* Greg Landrum <mailto:[email protected]>
*CC:* RDKit Discuss
<mailto:[email protected]>
*Subject:* Re: [Rdkit-discuss] delete a substructure
Hi Greg,
Thanks for a prompt reply. I did try
"GetSubstructMatches()" and it returns correct numbers of
substructures for CH3COOH. The potential problem with this
approach is that if the molecule is getting complicated,
it will possibly generate duplicate numbers for certain
functional groups. For example, --OH (alcohol) group will
be likely also counted in --COOH. A safer way, in my mind,
is to remove the substructure that has been counted.
Greg, you mentioned "chemical reaction functionality", can
you show me a demo script with that using CH3COOH as an
example. I will definitely delve into the manual to learn
more. But reading your code will be a good start.
Thanks,
Chenyang
On Sun, Mar 5, 2017 at 10:15 PM, Greg Landrum
<[email protected] <mailto:[email protected]>>
wrote:
Hi Chenyang,
If you're really interested in counting the number of
times the substructure appears, you can do that much
quicker with `GetSubstructMatches()`:
In [2]: m = Chem.MolFromSmiles('CC(C)CCO')
In [3]:
len(m.GetSubstructMatches(Chem.MolFromSmarts('[CH3;X4]')))
Out[3]: 2
Is that sufficient, or do you actually want to
sequentially remove all of the groups in your list?
If you actually want to remove them, you are probably
better off using the chemical reaction functionality
instead of DeleteSubstructs(), which recalculates the
number of implicit Hs on atoms after each call.
-greg
On Mon, Mar 6, 2017 at 4:21 AM, Chenyang Shi
<[email protected] <mailto:[email protected]>> wrote:
I am new to rdkit but I am already impressed by
its vibrant community. I have a question regarding
deleting substructure. In the RDKIT documentation,
this is a snippet of code describing how to delete
substructure:
>>>m = Chem.MolFromSmiles("CC(=O)O")
>>>patt = Chem.MolFromSmarts("C(=O)[OH]")
>>>rm = AllChem.DeleteSubstructs(m, patt)
>>>Chem.MolToSmiles(rm)
'C'
This block of code first loads a molecule CH3COOH
using SMILES code, then defines a substructure
COOH using SMARTS code which is to be deleted.
After final line of code, the program outputs 'C',
in SMILES form.
I had wanted to develop a method for detecting
number of groups in a molecule. In CH3COOH case, I
can search number of --CH3 and --COOH group by
using their respective SMARTS code with no
problem. However, when molecule becomes more
complicated, it is preferred to delete the
substructure that has been searched before moving
to next search using SMARTS code. Well, in current
case, after searching -COOH group and deleting it,
the leftover is 'C' which is essentially CH4
instead of --CH3. I cannot proceed with searching
with SMARTS code for --CH3 ([CH3;A;X4!R]).
Is there any way to work around this?
Thanks,
Chenyang
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the
world's most
engaging tech sites, SlashDot.org!
http://sdm.link/slashdot
_______________________________________________
Rdkit-discuss mailing list
[email protected]
<mailto:[email protected]>
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
<https://lists.sourceforge.net/lists/listinfo/rdkit-discuss>
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Rdkit-discuss mailing list
[email protected]
<mailto:[email protected]>
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
<https://lists.sourceforge.net/lists/listinfo/rdkit-discuss>
------------------------------------------------------------------------------
Announcing the Oxford Dictionaries API! The API offers world-renowned
dictionary content that is easy and intuitive to access. Sign up for an
account today to start using our lexical data to power your apps and
projects. Get started today and enter our developer competition.
http://sdm.link/oxford
_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
------------------------------------------------------------------------------
Announcing the Oxford Dictionaries API! The API offers world-renowned
dictionary content that is easy and intuitive to access. Sign up for an
account today to start using our lexical data to power your apps and
projects. Get started today and enter our developer competition.
http://sdm.link/oxford
_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss