Re: [Rdkit-discuss] Information contained in SMARTS and SMILES

2017-04-19 Thread Peter S. Shenkin
On Wed, Apr 19, 2017 at 7:25 PM, Andrew Dalke 
wrote:

> On Apr 19, 2017, at 23:59, Peter S. Shenkin  wrote:
> > One more thing. The term "Mol" in RDKit and some other tookits does not
> really mean "molecule" in the sense that chemists use it.
>
> ? I don't see how this is connected to the previous emails.
>

​The connection is that, based on the wording of the query, I thought ​that
perhaps Thilo was expecting a SMARTS to specify a molecule as chemists
understand the term.

-P.
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Information contained in SMARTS and SMILES

2017-04-19 Thread Andrew Dalke
On Apr 19, 2017, at 23:59, Peter S. Shenkin  wrote:
> One more thing. The term "Mol" in RDKit and some other tookits does not 
> really mean "molecule" in the sense that chemists use it.

? I don't see how this is connected to the previous emails.

I believe most toolkits use that terminology in their APIs. (Daylight, OEChem, 
Open Babel, RDKit, Indigo, JChem, and InChI).

I know that VMD does that too, and I believe PyMol and RasMol as well.

There is a minority of software which use other terms. CACTVS calls it a 
'molecular ensemble'. CDK an 'atom container' (though I see people assign it to 
variables with 'm' or 'mol' in it).

I haven't really run into people who found this to be an issue, so I've stopped 
bringing it up in my documentation or when I teach. I mostly work with 
computational chemists, and that bias may affect things.

But this current thread is a discussion between computational people, which is 
why I don't understand the relevancy.


> The way I think of it is that SMILES is like an ordinary string and SMARTS is 
> like a regex that can be used to flexibly match other strings.

I think this is a reasonable approximation for computer programmers. I modeled 
my PyDaylight wrapper on top of the Daylight toolkit using this view.

Then Greg and RDKit showed me that that view was narrower than need be. In 
RDKit, a molecule can also be used as a subgraph.

>>> from rdkit import Chem
>>> mol = Chem.MolFromSmiles("c1c1")
>>> from rdkit import Chem
>>> mol1 = Chem.MolFromSmiles("c1c1")
>>> mol2 = Chem.MolFromSmiles("c1c1O")
>>> mol2.HasSubstructMatch(mol1)
True
>>> mol1.HasSubstructMatch(mol2)
False

Stretching your analogy, this would be like a substring search rather than a 
regexp.

It's a difficult stretch because substring search has different performance 
characteristics to regexp search, while subgraph search is NP-complete even 
when only a simple SMILES is used to define the subgraph.

Alternatively, it could be like using a constrained glob pattern language 
instead of a more flexible regular expression. Well, except that SMILES as a 
pattern language has no flexibility for conjunction, disjunction, or repetition.


Furthermore, in RDKit a SMARTS pattern can (to a limited extent) be used to 
match a SMARTS pattern:

>>> pat1 = Chem.MolFromSmarts("[#7]=[#6]-[#8]")
>>> pat2 = Chem.MolFromSmarts("[#7]=[#8]")
>>> pat1.HasSubstructMatch(pat2)
False
>>> pat3 = Chem.MolFromSmarts("[#6]=[#7]")
>>> pat1.HasSubstructMatch(pat3)
True

I've used this once in my work when I generated simple subgraph fragments as 
SMARTS patterns then used the patterns against themselves to generate a 
hierarchical tree.

This would correspond roughly to checking if one regular expression is a subset 
of another, which is a very different algorithm than pattern matching a string.



Andrew
da...@dalkescientific.com



--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Information contained in SMARTS and SMILES

2017-04-19 Thread Peter S. Shenkin
One more thing. The term "Mol" in RDKit and some other tookits does not
really mean "molecule" in the sense that chemists use it. It is used to
connote a data structure that can store a SMARTS or a SMILES. Only when a
SMILES is used does it really correspond to a chemical "molecule", except,
in some cases, by accident; and, as Andrew pointed out, there are cases
when exactly the same string means different things in a SMARTS and SMILES
context.

The way I think of it is that SMILES is like an ordinary string and SMARTS
is like a regex that can be used to flexibly match other strings.

-P.



On Wed, Apr 19, 2017 at 5:20 PM, Andrew Dalke 
wrote:

> On Apr 19, 2017, at 18:26, Curt Fischer  wrote:
> > From chemistry stack exchange, an answer contributed by user R.M.:
> >
> > SMARTS is deliberately designed to be a superset of SMILES. That is, any
> valid SMILES depiction should also be a valid SMARTS query, one that will
> retrieve the very structure that the SMILES string depicts.
>
> Except, that last clause isn't true. Try matching tritium against itself.
>
> >>> from rdkit import Chem
> >>> mol = Chem.MolFromSmiles("[3H]")
> >>> pat = Chem.MolFromSmarts("[3H]")
> >>> mol.HasSubstructMatch(pat)
> False
>
> For hydrogens you must use '#1', because H in SMARTS means something
> different.
>
> >>> pat2 = Chem.MolFromSmarts("[3#1]")
> >>> mol.HasSubstructMatch(pat2)
> True
>
> SMILES input under Daylight and most other toolkits gets normalized to the
> chemistry model, including aromaticity perception:
>
> >>> mol = Chem.MolFromSmiles("C1=CC=CC=C1")
> >>> pat = Chem.MolFromSmarts("C1=CC=CC=C1")
> >>> mol.HasSubstructMatch(pat)
> False
> >>> pat2 = Chem.MolFromSmarts("c1c1")
> >>> mol.HasSubstructMatch(pat2)
> True
>
> RDKit also does a small amount of additional normalization, or
> 'sanitization' to use the RDKit term. For example, it will convert "neutral
> 5 coordinate Ns with double bonds to Os to the zwitterionic form" (see
> GraphMol/MolOps.cpp):
>
> >>> s = "CN(=O)=O"
> >>> mol = Chem.MolFromSmiles(s)
> >>> pat = Chem.MolFromSmarts(s)
> >>> mol.HasSubstructMatch(pat)
> False
> >>> Chem.MolToSmiles(mol)
> 'C[N+](=O)[O-]'
>
> I believe that the output SMILES from a toolkit, assuming that the SMILES
> doesn't have an explicit hydrogen, can be used a SMARTS which will match
> the molecule made from that same SMILES, by that same toolkit.
>
> This is a weaker statement than that made by user R.M.
>
> Andrew
> da...@dalkescientific.com
>
>
>
> 
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Information contained in SMARTS and SMILES

2017-04-19 Thread Andrew Dalke
On Apr 19, 2017, at 18:26, Curt Fischer  wrote:
> From chemistry stack exchange, an answer contributed by user R.M.:
> 
> SMARTS is deliberately designed to be a superset of SMILES. That is, any 
> valid SMILES depiction should also be a valid SMARTS query, one that will 
> retrieve the very structure that the SMILES string depicts.

Except, that last clause isn't true. Try matching tritium against itself.

>>> from rdkit import Chem
>>> mol = Chem.MolFromSmiles("[3H]")
>>> pat = Chem.MolFromSmarts("[3H]")
>>> mol.HasSubstructMatch(pat)
False

For hydrogens you must use '#1', because H in SMARTS means something different.

>>> pat2 = Chem.MolFromSmarts("[3#1]")
>>> mol.HasSubstructMatch(pat2)
True

SMILES input under Daylight and most other toolkits gets normalized to the 
chemistry model, including aromaticity perception:

>>> mol = Chem.MolFromSmiles("C1=CC=CC=C1")
>>> pat = Chem.MolFromSmarts("C1=CC=CC=C1")
>>> mol.HasSubstructMatch(pat)
False
>>> pat2 = Chem.MolFromSmarts("c1c1")
>>> mol.HasSubstructMatch(pat2)
True

RDKit also does a small amount of additional normalization, or 'sanitization' 
to use the RDKit term. For example, it will convert "neutral 5 coordinate Ns 
with double bonds to Os to the zwitterionic form" (see GraphMol/MolOps.cpp):

>>> s = "CN(=O)=O"
>>> mol = Chem.MolFromSmiles(s)
>>> pat = Chem.MolFromSmarts(s)
>>> mol.HasSubstructMatch(pat)
False
>>> Chem.MolToSmiles(mol)
'C[N+](=O)[O-]'

I believe that the output SMILES from a toolkit, assuming that the SMILES 
doesn't have an explicit hydrogen, can be used a SMARTS which will match the 
molecule made from that same SMILES, by that same toolkit.

This is a weaker statement than that made by user R.M.

Andrew
da...@dalkescientific.com



--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Information contained in SMARTS and SMILES

2017-04-19 Thread Andrew Dalke
On Apr 19, 2017, at 12:03, Thilo Bauer  wrote:
> is converting SMARTS to SMILES a "lossless" operation, or does one loose 
> information on doing so?


It is obviously not lossless if you include terms that cannot be represented in 
SMILES.

>>> from rdkit import Chem
>>> Chem.MolToSmiles(Chem.MolFromSmarts("[C,N]"))
'C'

or which don't make sense as a molecule:

>>> Chem.MolToSmiles(Chem.MolFromSmarts("c"))
'c'
>>> Chem.MolFromSmiles("c")
[23:02:24] non-ring atom 0 marked aromatic


It also loses some information which could be represented in SMILES:

>>> Chem.MolToSmiles(Chem.MolFromSmarts("[NH4+]"))
'N'
>>> Chem.MolToSmiles(Chem.MolFromSmarts("C[N+]1(C)C1"))
'CN1(C)C1'
>>> Chem.MolToSmiles(Chem.MolFromSmarts("[12C]"), isomericSmiles=True)
'C'

Do be careful if you want to handle aromatic atoms and bonds:

>>> Chem.MolToSmiles(Chem.MolFromSmarts("[#6]:1:[#6]:[#6]:[#6]:[#6]:[#6]:1"))
'C1:C:C:C:C:C:1'
>>> Chem.MolToSmiles(Chem.MolFromSmarts("c=1-c=c-c=c-c=1"))
'c1=c-c=c-c=c-1'


> Background:
> I've got three different SMARTS strings representing the same structure 
> - at least when depicting it. Also all three strings result in the exact 
> same SMILES (see code and output below).

It looks like you want SMARTS canonicalization.

In general this is hard, because SMARTS can include boolean expressions and 
recursive SMARTS.

If you limit yourself to patterns like '[#6]-1=[#6]-[#6]...', with only atomic 
numbers and single/double/triple bonds, then I think RDKit will do what you 
want.




Andrew
da...@dalkescientific.com



--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Information contained in SMARTS and SMILES

2017-04-19 Thread Curt Fischer
Hi Thilo,

Interesting question.  rdkit-discuss members should know you also posted a
very similar question to
https://chemistry.stackexchange.com/questions/72880/is-converting-smarts-to-smiles-a-lossless-operation
.

If an interesting answer materializes here, it would be useful to post it
there, and vice-versa.

Curt

On Wed, Apr 19, 2017 at 3:03 AM, Thilo Bauer  wrote:

> Dear mailinglist-members,
>
> is converting SMARTS to SMILES a "lossless" operation, or does one loose
> information on doing so?
>
> Background:
> I've got three different SMARTS strings representing the same structure
> - at least when depicting it. Also all three strings result in the exact
> same SMILES (see code and output below).
>
> Now, don't take this wrong, I do know the differences between SMARTS and
> SMILES, and I do know what the symbols in SMARTS mean. I just wonder,
> when I use either the threes SMARTS or the single SMILES as a pattern
> for a substruct match, if there is a chance that I get different
> results, or let's say if I would miss substructure occurences by using
> the single SMILES? I could not make up a case where this happened.
>
>
>  >>> m =
> Chem.MolFromSmarts('[#6]-1=[#6]-[#6](-[#6]-[#6](-[#6]-1)-[#6])=[#8]')
>  >>> Chem.MolToSmiles(m)
> 'CC1CC=CC(=O)C1'
>  >>> m = Chem.MolFromSmarts('[#6]-1-[#6]=[#6]-[#6](-[#6]-[#6]-1-[#6]
> )=[#8]')
>  >>> Chem.MolToSmiles(m)
> 'CC1CC=CC(=O)C1'
>  >>> m = Chem.MolFromSmarts('[#6]-1-[#6](-[#6]=[#6]-[#6]-[#6]-1-[#6]
> )=[#8]')
>  >>> Chem.MolToSmiles(m)
> 'CC1CC=CC(=O)C1'
>
>
> Thank's a lot in advance!
>
> Thilo
>
>
>
>
>
> 
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] Information contained in SMARTS and SMILES

2017-04-19 Thread Thilo Bauer
Dear mailinglist-members,

is converting SMARTS to SMILES a "lossless" operation, or does one loose 
information on doing so?

Background:
I've got three different SMARTS strings representing the same structure 
- at least when depicting it. Also all three strings result in the exact 
same SMILES (see code and output below).

Now, don't take this wrong, I do know the differences between SMARTS and 
SMILES, and I do know what the symbols in SMARTS mean. I just wonder, 
when I use either the threes SMARTS or the single SMILES as a pattern 
for a substruct match, if there is a chance that I get different 
results, or let's say if I would miss substructure occurences by using 
the single SMILES? I could not make up a case where this happened.


 >>> m = 
Chem.MolFromSmarts('[#6]-1=[#6]-[#6](-[#6]-[#6](-[#6]-1)-[#6])=[#8]')
 >>> Chem.MolToSmiles(m)
'CC1CC=CC(=O)C1'
 >>> m = Chem.MolFromSmarts('[#6]-1-[#6]=[#6]-[#6](-[#6]-[#6]-1-[#6])=[#8]')
 >>> Chem.MolToSmiles(m)
'CC1CC=CC(=O)C1'
 >>> m = Chem.MolFromSmarts('[#6]-1-[#6](-[#6]=[#6]-[#6]-[#6]-1-[#6])=[#8]')
 >>> Chem.MolToSmiles(m)
'CC1CC=CC(=O)C1'


Thank's a lot in advance!

Thilo





--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss