Re: [Rdkit-discuss] delete a substructure

2017-03-11 Thread David Cosgrove
There's a bit more to it than that. If you're developing a SMARTS for a
particular type of group, you can run it against a large file and see the
false hits and false non-hits quickly and revise your SMARTS accordingly.
And as you say, it is also available.
Dave


On Fri, 10 Mar 2017 at 20:53, Peter S. Shenkin  wrote:

> Sounds like Daylight's "depictmatch", unfortunately no longer available on
> line
>
> -P.
>
> On Fri, Mar 10, 2017 at 1:28 PM, David Cosgrove <
> davidacosgrov...@gmail.com> wrote:
>
> Hi,
> In the RDKit source, under the 2d drawing code in the c++ part there's the
> full source code for a QT program that will run one or more SMARTS patterns
> against a set of molecules, split any matches and non-matches into 2
> displays side by side and colour the atoms that the SMARTS match. It needs
> a bit of persistence to compile and has only been tried on Linux but is
> very helpful for writing new SMARTS. If there's interest, when I have a bit
> of spare time over the next few weeks I can make sure it's easier to
> compile. If you poke about in my website (cozchemix.co.uk) you'll find a
> link to my GitHub repo with an earlier version which has been compiled
> under Linux recently and has instructions. Sorry not to put links in, I
> don't have access to a computer st the moment, just phone.
>
> Cheers,
> Dave
>
> On Thu, 9 Mar 2017 at 18:41, Chenyang Shi  wrote:
>
> Thank you Chris. I found that one too; it is quite convenient to visualize
> both SMARTS and SMILES strings.
>
> On Thu, Mar 9, 2017 at 11:28 AM, Chris Swain  wrote:
>
> I use SMARTSviewer at Univ of Hamburg
>
> http://www.zbh.uni-hamburg.de/en/bioinformatics-server.html
>
> Chris
>
> On 9 Mar 2017, at 17:21, rdkit-discuss-requ...@lists.sourceforge.net
> wrote:
>
> One last question I have is do you guys have convenient online or local
> documents to look up desired SMARTS.
> Greg mentioned $RDBASE/Data/Functional_Group_Hierarchy.txt, which comes
> with the installation of RDKIT.
> Brian suggested daylight website,
> http://www.daylight.com/dayhtml_tutorials/languages/
> smarts/smarts_examples.html, which is a good place as well.
>
> Best,
> Chenyang
>
>
>
>
> --
> Announcing the Oxford Dictionaries API! The API offers world-renowned
> dictionary content that is easy and intuitive to access. Sign up for an
> account today to start using our lexical data to power your apps and
> projects. Get started today and enter our developer competition.
> http://sdm.link/oxford
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
>
> --
> Announcing the Oxford Dictionaries API! The API offers world-renowned
> dictionary content that is easy and intuitive to access. Sign up for an
> account today to start using our lexical data to power your apps and
> projects. Get started today and enter our developer competition.
> http://sdm.link/oxford___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
> --
> David Cosgrove
> Freelance computational chemistry and chemoinformatics developer
> http://cozchemix.co.uk
>
>
>
> --
> Announcing the Oxford Dictionaries API! The API offers world-renowned
> dictionary content that is easy and intuitive to access. Sign up for an
> account today to start using our lexical data to power your apps and
> projects. Get started today and enter our developer competition.
> http://sdm.link/oxford
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
> --
David Cosgrove
Freelance computational chemistry and chemoinformatics developer
http://cozchemix.co.uk
--
Announcing the Oxford Dictionaries API! The API offers world-renowned
dictionary content that is easy and intuitive to access. Sign up for an
account today to start using our lexical data to power your apps and
projects. Get started today and enter our developer competition.
http://sdm.link/oxford___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] delete a substructure

2017-03-10 Thread Peter S. Shenkin
Sounds like Daylight's "depictmatch", unfortunately no longer available on
line

-P.

On Fri, Mar 10, 2017 at 1:28 PM, David Cosgrove 
wrote:

> Hi,
> In the RDKit source, under the 2d drawing code in the c++ part there's the
> full source code for a QT program that will run one or more SMARTS patterns
> against a set of molecules, split any matches and non-matches into 2
> displays side by side and colour the atoms that the SMARTS match. It needs
> a bit of persistence to compile and has only been tried on Linux but is
> very helpful for writing new SMARTS. If there's interest, when I have a bit
> of spare time over the next few weeks I can make sure it's easier to
> compile. If you poke about in my website (cozchemix.co.uk) you'll find a
> link to my GitHub repo with an earlier version which has been compiled
> under Linux recently and has instructions. Sorry not to put links in, I
> don't have access to a computer st the moment, just phone.
>
> Cheers,
> Dave
>
> On Thu, 9 Mar 2017 at 18:41, Chenyang Shi  wrote:
>
>> Thank you Chris. I found that one too; it is quite convenient to
>> visualize both SMARTS and SMILES strings.
>>
>> On Thu, Mar 9, 2017 at 11:28 AM, Chris Swain  wrote:
>>
>> I use SMARTSviewer at Univ of Hamburg
>>
>> http://www.zbh.uni-hamburg.de/en/bioinformatics-server.html
>>
>> Chris
>>
>> On 9 Mar 2017, at 17:21, rdkit-discuss-requ...@lists.sourceforge.net
>> wrote:
>>
>> One last question I have is do you guys have convenient online or local
>> documents to look up desired SMARTS.
>> Greg mentioned $RDBASE/Data/Functional_Group_Hierarchy.txt, which comes
>> with the installation of RDKIT.
>> Brian suggested daylight website,
>> http://www.daylight.com/dayhtml_tutorials/languages/
>> smarts/smarts_examples.html, which is a good place as well.
>>
>> Best,
>> Chenyang
>>
>>
>>
>> 
>> --
>> Announcing the Oxford Dictionaries API! The API offers world-renowned
>> dictionary content that is easy and intuitive to access. Sign up for an
>> account today to start using our lexical data to power your apps and
>> projects. Get started today and enter our developer competition.
>> http://sdm.link/oxford
>> ___
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>>
>> 
>> --
>> Announcing the Oxford Dictionaries API! The API offers world-renowned
>> dictionary content that is easy and intuitive to access. Sign up for an
>> account today to start using our lexical data to power your apps and
>> projects. Get started today and enter our developer competition.
>> http://sdm.link/oxford___
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
> --
> David Cosgrove
> Freelance computational chemistry and chemoinformatics developer
> http://cozchemix.co.uk
>
>
> 
> --
> Announcing the Oxford Dictionaries API! The API offers world-renowned
> dictionary content that is easy and intuitive to access. Sign up for an
> account today to start using our lexical data to power your apps and
> projects. Get started today and enter our developer competition.
> http://sdm.link/oxford
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
--
Announcing the Oxford Dictionaries API! The API offers world-renowned
dictionary content that is easy and intuitive to access. Sign up for an
account today to start using our lexical data to power your apps and
projects. Get started today and enter our developer competition.
http://sdm.link/oxford___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] delete a substructure

2017-03-09 Thread Chenyang Shi
Thank you Chris. I found that one too; it is quite convenient to visualize
both SMARTS and SMILES strings.

On Thu, Mar 9, 2017 at 11:28 AM, Chris Swain  wrote:

> I use SMARTSviewer at Univ of Hamburg
>
> http://www.zbh.uni-hamburg.de/en/bioinformatics-server.html
>
> Chris
>
> On 9 Mar 2017, at 17:21, rdkit-discuss-requ...@lists.sourceforge.net
> wrote:
>
> One last question I have is do you guys have convenient online or local
> documents to look up desired SMARTS.
> Greg mentioned $RDBASE/Data/Functional_Group_Hierarchy.txt, which comes
> with the installation of RDKIT.
> Brian suggested daylight website,
> http://www.daylight.com/dayhtml_tutorials/languages/
> smarts/smarts_examples.html, which is a good place as well.
>
> Best,
> Chenyang
>
>
>
> 
> --
> Announcing the Oxford Dictionaries API! The API offers world-renowned
> dictionary content that is easy and intuitive to access. Sign up for an
> account today to start using our lexical data to power your apps and
> projects. Get started today and enter our developer competition.
> http://sdm.link/oxford
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
--
Announcing the Oxford Dictionaries API! The API offers world-renowned
dictionary content that is easy and intuitive to access. Sign up for an
account today to start using our lexical data to power your apps and
projects. Get started today and enter our developer competition.
http://sdm.link/oxford___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] delete a substructure

2017-03-09 Thread Chris Swain
I use SMARTSviewer at Univ of Hamburg

http://www.zbh.uni-hamburg.de/en/bioinformatics-server.html 


Chris
> On 9 Mar 2017, at 17:21, rdkit-discuss-requ...@lists.sourceforge.net wrote:
> 
> One last question I have is do you guys have convenient online or local
> documents to look up desired SMARTS.
> Greg mentioned $RDBASE/Data/Functional_Group_Hierarchy.txt, which comes
> with the installation of RDKIT.
> Brian suggested daylight website,
> http://www.daylight.com/dayhtml_tutorials/languages/ 
> 
> smarts/smarts_examples.html, which is a good place as well.
> 
> Best,
> Chenyang

--
Announcing the Oxford Dictionaries API! The API offers world-renowned
dictionary content that is easy and intuitive to access. Sign up for an
account today to start using our lexical data to power your apps and
projects. Get started today and enter our developer competition.
http://sdm.link/oxford___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] delete a substructure

2017-03-09 Thread Chenyang Shi
Thanks Hongbin and Pavel for the suggestions. I am now confident that the
approach Hongbin proposed to remove duplicate counts is a robust one. Now I
need to revisit/recheck all my SMARTS definitions.

One last question I have is do you guys have convenient online or local
documents to look up desired SMARTS.
Greg mentioned $RDBASE/Data/Functional_Group_Hierarchy.txt, which comes
with the installation of RDKIT.
Brian suggested daylight website,
http://www.daylight.com/dayhtml_tutorials/languages/
smarts/smarts_examples.html, which is a good place as well.

Best,
Chenyang

On Thu, Mar 9, 2017 at 1:09 AM, 杨弘宾 <yanyangh...@163.com> wrote:

> Hi Chemyang,
>
> Your issue was caused by the definition of "-OH(phenol)", I think.  If
> you define this pattern as "cO", the atom *3* will be matched since it is
> the aromatic carbon bond to an oxygen.  I guess you just wanted to match
> exactly the oxygen and restrict it with "bonding with an aromatic carbon".
> So the SMARTS should ber "[$(Oc)]", which indicates an oxygen with the
> environment of "bonding with an aromatic carbon".
>
> m = Chem.MolFromSmiles('CC1=CC(=C(C=C1)C(=O)O)O')
> m.GetSubstructMatches(Chem.MolFromSmiles('[$(Oc)]'))
> >>> ((10,),)
>
> Then only atom *10* will be matched and it won't interfere with other
> counts.
>
> Reference: http://www.daylight.com/dayhtml/doc/theory/theory.smarts.html
> 4.4
>
> --
> Hongbin Yang
>
>
> *From:* Chenyang Shi <cs3...@columbia.edu>
> *Date:* 2017-03-09 01:32
> *To:* Greg Landrum <greg.land...@gmail.com>
> *CC:* rdkit-discuss <rdkit-discuss@lists.sourceforge.net>; 杨弘宾
> <yanyangh...@163.com>
> *Subject:* Re: [Rdkit-discuss] delete a substructure
> Dear Hongbin,
>
> I tried your method on a molecule, 4-Methylsalicylic acid
> (CC1=CC(=C(C=C1)C(=O)O)O). I looped through all groups defined in Joback
> method (using SMARTS), and used m.GetSubstructMatches to print out all
> atom positions. The result is summarized in the table.
>
> We can see there are duplicated counts--coming from COOH group. As
> suggested by Hongbin, we can remove duplicated atoms by looking at their
> positions--in this case, ((9),), ((7,8,),), ((7,),), and ((8,),) are
> subsets of ((7,8,9)) from -COOH. Indeed we can get rid of these duplicates.
> However, I also noticed that Atom (3,) from =C< (ring) group is also a part
> of -OH (phenol) ((10,3),). If we apply the same algorithm to remove
> duplicates, the =C<(ring) group will be only counted twice instead of three
> times.
>
> Greg, you mentioned as an alternative I can delete substructure using
> chemical reaction method. It would be greatly appreciated if you could show
> me (point me to) a simple example code, perhaps on a simple molecule? I
> find myself at a loss when browsing the manual. I would like to try also in
> that direction.
>
> Thanks,
> Chenyang
>
>
> [image: Inline image 1]
>
>
> On Mon, Mar 6, 2017 at 1:52 AM, Greg Landrum <greg.land...@gmail.com>
> wrote:
>
>> The solution that Hongbin proposes to the double-counting problem is a
>> good one. Just be sure to sort your substructure queries in the right order
>> so that the more complex ones come first.
>>
>> Another thing you might think about is making your queries more specific.
>> For example, as you pointed out "[OH]" is very general and matches parts of
>> carboxylic acids and a number of other functional groups. The RDKit has a
>> set of fairly well tested (though certainly not perfect) functional group
>> definitions in $RDBASE/Data/Functional_Group_Hierarchy.txt. The alcohol
>> definition from there looks like this:
>> [O;H1;$(O-!@[#6;!$(C=!@[O,N,S])])]
>>
>>
>> -greg
>>
>>
>> On Mon, Mar 6, 2017 at 7:20 AM, 杨弘宾 <yanyangh...@163.com> wrote:
>>
>>> Hi, Chenyang,
>>> You don't need to delete the substructure from the molecule. Just
>>> check whehter the mapped atoms have been matched. For example:
>>>
>>> m = Chem.MolFromSmiles('CC(=O)O')
>>> OH = Chem.MolFromSmarts('[OH]')
>>> COOH = Chem.MolFromSmarts('C(O)=O')
>>>
>>> m.GetSubstructMatches(OH)
>>> >> ((3,),)
>>> m.GetSubstructMatchs(COOH)
>>> >> ((1, 3, 2),)
>>>
>>> Since atom "3" has been already matched, it should be ignored.
>>> So you can create a "set" to record the matched atoms to avoid
>>> repetitive count.
>>>
>>> --
>>> Hongbin Yang 杨弘宾
>>>
>>>
>>

Re: [Rdkit-discuss] delete a substructure

2017-03-08 Thread 杨弘宾



网易邮箱






Hi Chemyang,
    Your issue was caused by the definition of "-OH(phenol)", I think.  If you 
define this pattern as "cO", the atom 3 will be matched since it is the 
aromatic carbon bond to an oxygen.  I guess you just wanted to match exactly 
the oxygen and restrict it with "bonding with an aromatic carbon". So the 
SMARTS should ber "[$(Oc)]", which indicates an oxygen with the environment of 
"bonding with an aromatic carbon".
    m = Chem.MolFromSmiles('CC1=CC(=C(C=C1)C(=O)O)O')    
m.GetSubstructMatches(Chem.MolFromSmiles('[$(Oc)]'))    >>> ((10,),)
Then only atom 10 will be matched and it won't interfere with other counts.
Reference: http://www.daylight.com/dayhtml/doc/theory/theory.smarts.html  4.4


Hongbin Yang 

 From: Chenyang ShiDate: 2017-03-09 01:32To: Greg LandrumCC: rdkit-discuss; 
杨弘宾Subject: Re: [Rdkit-discuss] delete a substructure



网易邮箱



Dear Hongbin,
I tried your method on a molecule, 4-Methylsalicylic acid 
(CC1=CC(=C(C=C1)C(=O)O)O). I looped through all groups defined in Joback method 
(using SMARTS), and used m.GetSubstructMatches to print out all atom positions. 
The result is summarized in the table. 
We can see there are duplicated counts--coming from COOH group. As suggested by 
Hongbin, we can remove duplicated atoms by looking at their positions--in this 
case, ((9),), ((7,8,),), ((7,),), and ((8,),) are subsets of ((7,8,9)) from 
-COOH. Indeed we can get rid of these duplicates. However, I also noticed that 
Atom (3,) from =C< (ring) group is also a part of -OH (phenol) ((10,3),). If we 
apply the same algorithm to remove duplicates, the =C<(ring) group will be only 
counted twice instead of three times.  
Greg, you mentioned as an alternative I can delete substructure using chemical 
reaction method. It would be greatly appreciated if you could show me (point me 
to) a simple example code, perhaps on a simple molecule? I find myself at a 
loss when browsing the manual. I would like to try also in that direction.
Thanks,Chenyang





On Mon, Mar 6, 2017 at 1:52 AM, Greg Landrum <greg.land...@gmail.com> wrote:
The solution that Hongbin proposes to the double-counting problem is a good 
one. Just be sure to sort your substructure queries in the right order so that 
the more complex ones come first.
Another thing you might think about is making your queries more specific. For 
example, as you pointed out "[OH]" is very general and matches parts of 
carboxylic acids and a number of other functional groups. The RDKit has a set 
of fairly well tested (though certainly not perfect) functional group 
definitions in $RDBASE/Data/Functional_Group_Hierarchy.txt. The alcohol 
definition from there looks like this:[O;H1;$(O-!@[#6;!$(C=!@[O,N,S])])]


-greg

On Mon, Mar 6, 2017 at 7:20 AM, 杨弘宾 <yanyangh...@163.com> wrote:

Hi, Chenyang,    You don't need to delete the substructure from the molecule. 
Just check whehter the mapped atoms have been matched. For example:
m = Chem.MolFromSmiles('CC(=O)O')OH = Chem.MolFromSmarts('[OH]')COOH = 
Chem.MolFromSmarts('C(O)=O')
m.GetSubstructMatches(OH)>> ((3,),)m.GetSubstructMatchs(COOH)>> ((1, 3, 2),)
Since atom "3" has been already matched, it should be ignored. So you can 
create a "set" to record the matched atoms to avoid repetitive count.


Hongbin Yang 杨弘宾


 From: Chenyang ShiDate: 2017-03-06 14:04To: Greg LandrumCC: RDKit 
DiscussSubject: Re: [Rdkit-discuss] delete a substructureHi Greg,
Thanks for a prompt reply. I did try "GetSubstructMatches()" and it returns 
correct numbers of substructures for CH3COOH. The potential problem with this 
approach is that if the molecule is getting complicated, it will possibly 
generate duplicate numbers for certain functional groups. For example, --OH 
(alcohol) group will be likely also counted in --COOH. A safer way, in my mind, 
is to remove the substructure that has been counted. 
Greg, you mentioned "chemical reaction functionality", can you show me a demo 
script with that using CH3COOH as an example. I will definitely delve into the 
manual to learn more. But reading your code will be a good start. 
Thanks,Chenyang
 
On Sun, Mar 5, 2017 at 10:15 PM, Greg Landrum <greg.land...@gmail.com> wrote:
Hi Chenyang,
If you're really interested in counting the number of times the substructure 
appears, you can do that much quicker with `GetSubstructMatches()`:
In [2]: m = Chem.MolFromSmiles('CC(C)CCO')In [3]: 
len(m.GetSubstructMatches(Chem.MolFromSmarts('[CH3;X4]')))
Out[3]: 2
Is that sufficient, or do you actually want to sequentially remove all of the 
groups in your list?
If you actually want to remove them, you are probably better off using the 
chemical reaction functionality instead of DeleteSubstructs(), which 
recalculates the number of implicit Hs on atoms after each call.
-greg

On Mon, Mar 6, 2017 at 4:21 AM, Chenyang Shi <c

Re: [Rdkit-discuss] delete a substructure

2017-03-08 Thread Pavel Polishchuk
You might find this link useful - 
http://www.rdkit.org/docs/GettingStartedInPython.html#chemical-transformations


However, the issue in your case is SMARTS definitions. If one SMARTS 
completely covers another one it would be difficult to understand is it 
artifact or not.I think it might be reasonable to revise SMARTS to avoid 
such overlapping or create a list of rules (maybe hierarchical) which 
will define valid and not valid overlappings.


Pavel.


On 03/08/2017 06:32 PM, Chenyang Shi wrote:

Dear Hongbin,

I tried your method on a molecule, 4-Methylsalicylic acid 
(CC1=CC(=C(C=C1)C(=O)O)O). I looped through all groups defined in 
Joback method (using SMARTS), and used m.GetSubstructMatches to print 
out all atom positions. The result is summarized in the table.


We can see there are duplicated counts--coming from COOH group. As 
suggested by Hongbin, we can remove duplicated atoms by looking at 
their positions--in this case, ((9),), ((7,8,),), ((7,),), and ((8,),) 
are subsets of ((7,8,9)) from -COOH. Indeed we can get rid of these 
duplicates. However, I also noticed that Atom (3,) from =C< (ring) 
group is also a part of -OH (phenol) ((10,3),). If we apply the same 
algorithm to remove duplicates, the =C<(ring) group will be only 
counted twice instead of three times.


Greg, you mentioned as an alternative I can delete substructure using 
chemical reaction method. It would be greatly appreciated if you could 
show me (point me to) a simple example code, perhaps on a simple 
molecule? I find myself at a loss when browsing the manual. I would 
like to try also in that direction.


Thanks,
Chenyang


Inline image 1


On Mon, Mar 6, 2017 at 1:52 AM, Greg Landrum <greg.land...@gmail.com 
<mailto:greg.land...@gmail.com>> wrote:


The solution that Hongbin proposes to the double-counting problem
is a good one. Just be sure to sort your substructure queries in
the right order so that the more complex ones come first.

Another thing you might think about is making your queries more
specific. For example, as you pointed out "[OH]" is very general
and matches parts of carboxylic acids and a number of other
functional groups. The RDKit has a set of fairly well tested
(though certainly not perfect) functional group definitions in
$RDBASE/Data/Functional_Group_Hierarchy.txt. The alcohol
definition from there looks like this:
[O;H1;$(O-!@[#6;!$(C=!@[O,N,S])])]


-greg


On Mon, Mar 6, 2017 at 7:20 AM, 杨弘宾 <yanyangh...@163.com
<mailto:yanyangh...@163.com>> wrote:

Hi, Chenyang,
You don't need to delete the substructure from the molecule.
Just check whehter the mapped atoms have been matched. For
example:

m = Chem.MolFromSmiles('CC(=O)O')
OH = Chem.MolFromSmarts('[OH]')
COOH = Chem.MolFromSmarts('C(O)=O')

m.GetSubstructMatches(OH)
>>((3,),)
m.GetSubstructMatchs(COOH)
>>((1, 3, 2),)

Since atom "3" has been already matched, it should be ignored.
So you can create a "set" to record the matched atoms to avoid
repetitive count.


Hongbin Yang 杨弘宾

*From:* Chenyang Shi <mailto:cs3...@columbia.edu>
*Date:* 2017-03-06 14:04
*To:* Greg Landrum <mailto:greg.land...@gmail.com>
*CC:* RDKit Discuss
    <mailto:rdkit-discuss@lists.sourceforge.net>
*Subject:* Re: [Rdkit-discuss] delete a substructure
Hi Greg,

Thanks for a prompt reply. I did try
"GetSubstructMatches()" and it returns correct numbers of
substructures for CH3COOH. The potential problem with this
approach is that if the molecule is getting complicated,
it will possibly generate duplicate numbers for certain
functional groups. For example, --OH (alcohol) group will
be likely also counted in --COOH. A safer way, in my mind,
is to remove the substructure that has been counted.

Greg, you mentioned "chemical reaction functionality", can
you show me a demo script with that using CH3COOH as an
example. I will definitely delve into the manual to learn
more. But reading your code will be a good start.

Thanks,
Chenyang


On Sun, Mar 5, 2017 at 10:15 PM, Greg Landrum
<greg.land...@gmail.com <mailto:greg.land...@gmail.com>>
wrote:

Hi Chenyang,

If you're really interested in counting the number of
times the substructure appears, you can do that much
quicker with `GetSubstructMatches()`:

In [2]: m = Chem.MolFromSmiles('CC(C)CCO')
  

Re: [Rdkit-discuss] delete a substructure

2017-03-06 Thread Chenyang Shi
Hongbin and Greg,
Thank you both for kind suggestions. I will try both approaches and report
my progress later.
Best,
Chenyang

On Monday, March 6, 2017, Greg Landrum <greg.land...@gmail.com> wrote:

> The solution that Hongbin proposes to the double-counting problem is a
> good one. Just be sure to sort your substructure queries in the right order
> so that the more complex ones come first.
>
> Another thing you might think about is making your queries more specific.
> For example, as you pointed out "[OH]" is very general and matches parts of
> carboxylic acids and a number of other functional groups. The RDKit has a
> set of fairly well tested (though certainly not perfect) functional group
> definitions in $RDBASE/Data/Functional_Group_Hierarchy.txt. The alcohol
> definition from there looks like this:
> [O;H1;$(O-!@[#6;!$(C=!@[O,N,S])])]
>
>
> -greg
>
>
> On Mon, Mar 6, 2017 at 7:20 AM, 杨弘宾 <yanyangh...@163.com
> <javascript:_e(%7B%7D,'cvml','yanyangh...@163.com');>> wrote:
>
>> Hi, Chenyang,
>> You don't need to delete the substructure from the molecule. Just
>> check whehter the mapped atoms have been matched. For example:
>>
>> m = Chem.MolFromSmiles('CC(=O)O')
>> OH = Chem.MolFromSmarts('[OH]')
>> COOH = Chem.MolFromSmarts('C(O)=O')
>>
>> m.GetSubstructMatches(OH)
>> >> ((3,),)
>> m.GetSubstructMatchs(COOH)
>> >> ((1, 3, 2),)
>>
>> Since atom "3" has been already matched, it should be ignored.
>> So you can create a "set" to record the matched atoms to avoid
>> repetitive count.
>>
>> --
>> Hongbin Yang 杨弘宾
>>
>>
>> *From:* Chenyang Shi
>> <javascript:_e(%7B%7D,'cvml','cs3...@columbia.edu');>
>> *Date:* 2017-03-06 14:04
>> *To:* Greg Landrum
>> <javascript:_e(%7B%7D,'cvml','greg.land...@gmail.com');>
>> *CC:* RDKit Discuss
>> <javascript:_e(%7B%7D,'cvml','rdkit-discuss@lists.sourceforge.net');>
>> *Subject:* Re: [Rdkit-discuss] delete a substructure
>> Hi Greg,
>>
>> Thanks for a prompt reply. I did try "GetSubstructMatches()" and it
>> returns correct numbers of substructures for CH3COOH. The potential problem
>> with this approach is that if the molecule is getting complicated, it will
>> possibly generate duplicate numbers for certain functional groups. For
>> example, --OH (alcohol) group will be likely also counted in --COOH. A
>> safer way, in my mind, is to remove the substructure that has been counted.
>>
>> Greg, you mentioned "chemical reaction functionality", can you show me a
>> demo script with that using CH3COOH as an example. I will definitely delve
>> into the manual to learn more. But reading your code will be a good start.
>>
>> Thanks,
>> Chenyang
>>
>>
>>
>> On Sun, Mar 5, 2017 at 10:15 PM, Greg Landrum <greg.land...@gmail.com
>> <javascript:_e(%7B%7D,'cvml','greg.land...@gmail.com');>> wrote:
>>
>>> Hi Chenyang,
>>>
>>> If you're really interested in counting the number of times the
>>> substructure appears, you can do that much quicker with
>>> `GetSubstructMatches()`:
>>>
>>> In [2]: m = Chem.MolFromSmiles('CC(C)CCO')
>>> In [3]: len(m.GetSubstructMatches(Chem.MolFromSmarts('[CH3;X4]')))
>>> Out[3]: 2
>>>
>>> Is that sufficient, or do you actually want to sequentially remove all
>>> of the groups in your list?
>>>
>>> If you actually want to remove them, you are probably better off using
>>> the chemical reaction functionality instead of DeleteSubstructs(), which
>>> recalculates the number of implicit Hs on atoms after each call.
>>>
>>> -greg
>>>
>>>
>>> On Mon, Mar 6, 2017 at 4:21 AM, Chenyang Shi <cs3...@columbia.edu
>>> <javascript:_e(%7B%7D,'cvml','cs3...@columbia.edu');>> wrote:
>>>
>>>> I am new to rdkit but I am already impressed by its vibrant community.
>>>> I have a question regarding deleting substructure. In the RDKIT
>>>> documentation, this is a snippet of code describing how to delete
>>>> substructure:
>>>>
>>>> >>>m = Chem.MolFromSmiles("CC(=O)O")
>>>> >>>patt = Chem.MolFromSmarts("C(=O)[OH]")
>>>> >>>rm = AllChem.DeleteSubstructs(m, patt)
>>>> >>>Chem.MolToSmiles(rm)
>>>> 'C'
>>>>
>>>> This block of code f

Re: [Rdkit-discuss] delete a substructure

2017-03-05 Thread Greg Landrum
The solution that Hongbin proposes to the double-counting problem is a good
one. Just be sure to sort your substructure queries in the right order so
that the more complex ones come first.

Another thing you might think about is making your queries more specific.
For example, as you pointed out "[OH]" is very general and matches parts of
carboxylic acids and a number of other functional groups. The RDKit has a
set of fairly well tested (though certainly not perfect) functional group
definitions in $RDBASE/Data/Functional_Group_Hierarchy.txt. The alcohol
definition from there looks like this:
[O;H1;$(O-!@[#6;!$(C=!@[O,N,S])])]


-greg


On Mon, Mar 6, 2017 at 7:20 AM, 杨弘宾 <yanyangh...@163.com> wrote:

> Hi, Chenyang,
> You don't need to delete the substructure from the molecule. Just
> check whehter the mapped atoms have been matched. For example:
>
> m = Chem.MolFromSmiles('CC(=O)O')
> OH = Chem.MolFromSmarts('[OH]')
> COOH = Chem.MolFromSmarts('C(O)=O')
>
> m.GetSubstructMatches(OH)
> >> ((3,),)
> m.GetSubstructMatchs(COOH)
> >> ((1, 3, 2),)
>
> Since atom "3" has been already matched, it should be ignored.
> So you can create a "set" to record the matched atoms to avoid repetitive
> count.
>
> --
> Hongbin Yang 杨弘宾
>
>
> *From:* Chenyang Shi <cs3...@columbia.edu>
> *Date:* 2017-03-06 14:04
> *To:* Greg Landrum <greg.land...@gmail.com>
> *CC:* RDKit Discuss <rdkit-discuss@lists.sourceforge.net>
> *Subject:* Re: [Rdkit-discuss] delete a substructure
> Hi Greg,
>
> Thanks for a prompt reply. I did try "GetSubstructMatches()" and it
> returns correct numbers of substructures for CH3COOH. The potential problem
> with this approach is that if the molecule is getting complicated, it will
> possibly generate duplicate numbers for certain functional groups. For
> example, --OH (alcohol) group will be likely also counted in --COOH. A
> safer way, in my mind, is to remove the substructure that has been counted.
>
> Greg, you mentioned "chemical reaction functionality", can you show me a
> demo script with that using CH3COOH as an example. I will definitely delve
> into the manual to learn more. But reading your code will be a good start.
>
> Thanks,
> Chenyang
>
>
>
> On Sun, Mar 5, 2017 at 10:15 PM, Greg Landrum <greg.land...@gmail.com>
> wrote:
>
>> Hi Chenyang,
>>
>> If you're really interested in counting the number of times the
>> substructure appears, you can do that much quicker with
>> `GetSubstructMatches()`:
>>
>> In [2]: m = Chem.MolFromSmiles('CC(C)CCO')
>> In [3]: len(m.GetSubstructMatches(Chem.MolFromSmarts('[CH3;X4]')))
>> Out[3]: 2
>>
>> Is that sufficient, or do you actually want to sequentially remove all of
>> the groups in your list?
>>
>> If you actually want to remove them, you are probably better off using
>> the chemical reaction functionality instead of DeleteSubstructs(), which
>> recalculates the number of implicit Hs on atoms after each call.
>>
>> -greg
>>
>>
>> On Mon, Mar 6, 2017 at 4:21 AM, Chenyang Shi <cs3...@columbia.edu> wrote:
>>
>>> I am new to rdkit but I am already impressed by its vibrant community. I
>>> have a question regarding deleting substructure. In the RDKIT
>>> documentation, this is a snippet of code describing how to delete
>>> substructure:
>>>
>>> >>>m = Chem.MolFromSmiles("CC(=O)O")
>>> >>>patt = Chem.MolFromSmarts("C(=O)[OH]")
>>> >>>rm = AllChem.DeleteSubstructs(m, patt)
>>> >>>Chem.MolToSmiles(rm)
>>> 'C'
>>>
>>> This block of code first loads a molecule CH3COOH using SMILES code,
>>> then defines a substructure COOH using SMARTS code which is to be deleted.
>>> After final line of code, the program outputs 'C', in SMILES form.
>>>
>>> I had wanted to develop a method for detecting number of groups in a
>>> molecule. In CH3COOH case, I can search number of --CH3 and --COOH group by
>>> using their respective SMARTS code with no problem. However, when molecule
>>> becomes more complicated, it is preferred to delete the substructure that
>>> has been searched before moving to next search using SMARTS code. Well, in
>>> current case, after searching -COOH group and deleting it, the leftover is
>>> 'C' which is essentially CH4 instead of --CH3. I cannot proceed with
>>> searching with SMARTS code for --CH3 ([CH3;A;X4!R]).
>>>
>>> Is there any way to work around this?
>>> Th

Re: [Rdkit-discuss] delete a substructure

2017-03-05 Thread 杨弘宾






Hi, Chenyang,    You don't need to delete the substructure from the molecule. 
Just check whehter the mapped atoms have been matched. For example:
m = Chem.MolFromSmiles('CC(=O)O')OH = Chem.MolFromSmarts('[OH]')COOH = 
Chem.MolFromSmarts('C(O)=O')
m.GetSubstructMatches(OH)>> ((3,),)m.GetSubstructMatchs(COOH)>> ((1, 3, 2),)
Since atom "3" has been already matched, it should be ignored. So you can 
create a "set" to record the matched atoms to avoid repetitive count.


Hongbin Yang 杨弘宾


 From: Chenyang ShiDate: 2017-03-06 14:04To: Greg LandrumCC: RDKit 
DiscussSubject: Re: [Rdkit-discuss] delete a substructureHi Greg,
Thanks for a prompt reply. I did try "GetSubstructMatches()" and it returns 
correct numbers of substructures for CH3COOH. The potential problem with this 
approach is that if the molecule is getting complicated, it will possibly 
generate duplicate numbers for certain functional groups. For example, --OH 
(alcohol) group will be likely also counted in --COOH. A safer way, in my mind, 
is to remove the substructure that has been counted. 
Greg, you mentioned "chemical reaction functionality", can you show me a demo 
script with that using CH3COOH as an example. I will definitely delve into the 
manual to learn more. But reading your code will be a good start. 
Thanks,Chenyang
 
On Sun, Mar 5, 2017 at 10:15 PM, Greg Landrum  wrote:
Hi Chenyang,
If you're really interested in counting the number of times the substructure 
appears, you can do that much quicker with `GetSubstructMatches()`:
In [2]: m = Chem.MolFromSmiles('CC(C)CCO')In [3]: 
len(m.GetSubstructMatches(Chem.MolFromSmarts('[CH3;X4]')))
Out[3]: 2
Is that sufficient, or do you actually want to sequentially remove all of the 
groups in your list?
If you actually want to remove them, you are probably better off using the 
chemical reaction functionality instead of DeleteSubstructs(), which 
recalculates the number of implicit Hs on atoms after each call.
-greg

On Mon, Mar 6, 2017 at 4:21 AM, Chenyang Shi  wrote:
I am new to rdkit but I am already impressed by its vibrant community. I have a 
question regarding deleting substructure. In the RDKIT documentation, this is a 
snippet of code describing how to delete substructure:
>>>m = Chem.MolFromSmiles("CC(=O)O")>>>patt = 
>>>Chem.MolFromSmarts("C(=O)[OH]")>>>rm = AllChem.DeleteSubstructs(m, 
>>>patt)>>>Chem.MolToSmiles(rm)'C'
This block of code first loads a molecule CH3COOH using SMILES code, then 
defines a substructure COOH using SMARTS code which is to be deleted. After 
final line of code, the program outputs 'C', in SMILES form. 
I had wanted to develop a method for detecting number of groups in a molecule. 
In CH3COOH case, I can search number of --CH3 and --COOH group by using their 
respective SMARTS code with no problem. However, when molecule becomes more 
complicated, it is preferred to delete the substructure that has been searched 
before moving to next search using SMARTS code. Well, in current case, after 
searching -COOH group and deleting it, the leftover is 'C' which is essentially 
CH4 instead of --CH3. I cannot proceed with searching with SMARTS code for 
--CH3 ([CH3;A;X4!R]). 
Is there any way to work around this?Thanks,Chenyang 
 

--

Check out the vibrant tech community on one of the world's most

engaging tech sites, SlashDot.org! http://sdm.link/slashdot
___

Rdkit-discuss mailing list

Rdkit-discuss@lists.sourceforge.net

https://lists.sourceforge.net/lists/listinfo/rdkit-discuss







--
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] delete a substructure

2017-03-05 Thread Chenyang Shi
Hi Greg,

Thanks for a prompt reply. I did try "GetSubstructMatches()" and it returns
correct numbers of substructures for CH3COOH. The potential problem with
this approach is that if the molecule is getting complicated, it will
possibly generate duplicate numbers for certain functional groups. For
example, --OH (alcohol) group will be likely also counted in --COOH. A
safer way, in my mind, is to remove the substructure that has been counted.

Greg, you mentioned "chemical reaction functionality", can you show me a
demo script with that using CH3COOH as an example. I will definitely delve
into the manual to learn more. But reading your code will be a good start.

Thanks,
Chenyang



On Sun, Mar 5, 2017 at 10:15 PM, Greg Landrum 
wrote:

> Hi Chenyang,
>
> If you're really interested in counting the number of times the
> substructure appears, you can do that much quicker with
> `GetSubstructMatches()`:
>
> In [2]: m = Chem.MolFromSmiles('CC(C)CCO')
> In [3]: len(m.GetSubstructMatches(Chem.MolFromSmarts('[CH3;X4]')))
> Out[3]: 2
>
> Is that sufficient, or do you actually want to sequentially remove all of
> the groups in your list?
>
> If you actually want to remove them, you are probably better off using the
> chemical reaction functionality instead of DeleteSubstructs(), which
> recalculates the number of implicit Hs on atoms after each call.
>
> -greg
>
>
> On Mon, Mar 6, 2017 at 4:21 AM, Chenyang Shi  wrote:
>
>> I am new to rdkit but I am already impressed by its vibrant community. I
>> have a question regarding deleting substructure. In the RDKIT
>> documentation, this is a snippet of code describing how to delete
>> substructure:
>>
>> >>>m = Chem.MolFromSmiles("CC(=O)O")
>> >>>patt = Chem.MolFromSmarts("C(=O)[OH]")
>> >>>rm = AllChem.DeleteSubstructs(m, patt)
>> >>>Chem.MolToSmiles(rm)
>> 'C'
>>
>> This block of code first loads a molecule CH3COOH using SMILES code, then
>> defines a substructure COOH using SMARTS code which is to be deleted. After
>> final line of code, the program outputs 'C', in SMILES form.
>>
>> I had wanted to develop a method for detecting number of groups in a
>> molecule. In CH3COOH case, I can search number of --CH3 and --COOH group by
>> using their respective SMARTS code with no problem. However, when molecule
>> becomes more complicated, it is preferred to delete the substructure that
>> has been searched before moving to next search using SMARTS code. Well, in
>> current case, after searching -COOH group and deleting it, the leftover is
>> 'C' which is essentially CH4 instead of --CH3. I cannot proceed with
>> searching with SMARTS code for --CH3 ([CH3;A;X4!R]).
>>
>> Is there any way to work around this?
>> Thanks,
>> Chenyang
>>
>>
>>
>> 
>> --
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
>> ___
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>>
>
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] delete a substructure

2017-03-05 Thread Greg Landrum
Hi Chenyang,

If you're really interested in counting the number of times the
substructure appears, you can do that much quicker with
`GetSubstructMatches()`:

In [2]: m = Chem.MolFromSmiles('CC(C)CCO')
In [3]: len(m.GetSubstructMatches(Chem.MolFromSmarts('[CH3;X4]')))
Out[3]: 2

Is that sufficient, or do you actually want to sequentially remove all of
the groups in your list?

If you actually want to remove them, you are probably better off using the
chemical reaction functionality instead of DeleteSubstructs(), which
recalculates the number of implicit Hs on atoms after each call.

-greg


On Mon, Mar 6, 2017 at 4:21 AM, Chenyang Shi  wrote:

> I am new to rdkit but I am already impressed by its vibrant community. I
> have a question regarding deleting substructure. In the RDKIT
> documentation, this is a snippet of code describing how to delete
> substructure:
>
> >>>m = Chem.MolFromSmiles("CC(=O)O")
> >>>patt = Chem.MolFromSmarts("C(=O)[OH]")
> >>>rm = AllChem.DeleteSubstructs(m, patt)
> >>>Chem.MolToSmiles(rm)
> 'C'
>
> This block of code first loads a molecule CH3COOH using SMILES code, then
> defines a substructure COOH using SMARTS code which is to be deleted. After
> final line of code, the program outputs 'C', in SMILES form.
>
> I had wanted to develop a method for detecting number of groups in a
> molecule. In CH3COOH case, I can search number of --CH3 and --COOH group by
> using their respective SMARTS code with no problem. However, when molecule
> becomes more complicated, it is preferred to delete the substructure that
> has been searched before moving to next search using SMARTS code. Well, in
> current case, after searching -COOH group and deleting it, the leftover is
> 'C' which is essentially CH4 instead of --CH3. I cannot proceed with
> searching with SMARTS code for --CH3 ([CH3;A;X4!R]).
>
> Is there any way to work around this?
> Thanks,
> Chenyang
>
>
>
> 
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] delete a substructure

2017-03-05 Thread Chenyang Shi
Hi everyone,

I am new to rdkit but I am already impressed by its vibrant community. I
have a question regarding deleting substructure. In the RDKIT
documentation, this is a snippet of code describing how to delete
substructure:

>>>m = Chem.MolFromSmiles("CC(=O)O")
>>>patt = Chem.MolFromSmarts("C(=O)[OH]")
>>>rm = AllChem.DeleteSubstructs(m, patt)
>>>Chem.MolToSmiles(rm)
'C'

This block of code first loads a molecule CH3COOH using SMILES code, then
defines a substructure COOH using SMARTS code which is to be deleted. After
final line of code, the program outputs 'C', in SMILES form.

I had wanted to develop a method for detecting number of groups in a
molecule. In CH3COOH case, I can search number of --CH3 and --COOH group by
using their respective SMARTS code with no problem. However, when molecule
becomes more complicated, it is preferred to delete the substructure that
has been searched before moving to next search using SMARTS code. Well, in
current case, after searching -COOH group and deleting it, the leftover is
'C' which is essentially CH4 instead of --CH3. I cannot proceed with
searching with SMARTS code for --CH3 ([CH3;A;X4!R]).

Is there any way to work around this?
Thanks,
Chenyang
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss