Re: [Rdkit-discuss] Information contained in SMARTS and SMILES
On Wed, Apr 19, 2017 at 7:25 PM, Andrew Dalkewrote: > On Apr 19, 2017, at 23:59, Peter S. Shenkin wrote: > > One more thing. The term "Mol" in RDKit and some other tookits does not > really mean "molecule" in the sense that chemists use it. > > ? I don't see how this is connected to the previous emails. > The connection is that, based on the wording of the query, I thought that perhaps Thilo was expecting a SMARTS to specify a molecule as chemists understand the term. -P. -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Information contained in SMARTS and SMILES
On Apr 19, 2017, at 23:59, Peter S. Shenkinwrote: > One more thing. The term "Mol" in RDKit and some other tookits does not > really mean "molecule" in the sense that chemists use it. ? I don't see how this is connected to the previous emails. I believe most toolkits use that terminology in their APIs. (Daylight, OEChem, Open Babel, RDKit, Indigo, JChem, and InChI). I know that VMD does that too, and I believe PyMol and RasMol as well. There is a minority of software which use other terms. CACTVS calls it a 'molecular ensemble'. CDK an 'atom container' (though I see people assign it to variables with 'm' or 'mol' in it). I haven't really run into people who found this to be an issue, so I've stopped bringing it up in my documentation or when I teach. I mostly work with computational chemists, and that bias may affect things. But this current thread is a discussion between computational people, which is why I don't understand the relevancy. > The way I think of it is that SMILES is like an ordinary string and SMARTS is > like a regex that can be used to flexibly match other strings. I think this is a reasonable approximation for computer programmers. I modeled my PyDaylight wrapper on top of the Daylight toolkit using this view. Then Greg and RDKit showed me that that view was narrower than need be. In RDKit, a molecule can also be used as a subgraph. >>> from rdkit import Chem >>> mol = Chem.MolFromSmiles("c1c1") >>> from rdkit import Chem >>> mol1 = Chem.MolFromSmiles("c1c1") >>> mol2 = Chem.MolFromSmiles("c1c1O") >>> mol2.HasSubstructMatch(mol1) True >>> mol1.HasSubstructMatch(mol2) False Stretching your analogy, this would be like a substring search rather than a regexp. It's a difficult stretch because substring search has different performance characteristics to regexp search, while subgraph search is NP-complete even when only a simple SMILES is used to define the subgraph. Alternatively, it could be like using a constrained glob pattern language instead of a more flexible regular expression. Well, except that SMILES as a pattern language has no flexibility for conjunction, disjunction, or repetition. Furthermore, in RDKit a SMARTS pattern can (to a limited extent) be used to match a SMARTS pattern: >>> pat1 = Chem.MolFromSmarts("[#7]=[#6]-[#8]") >>> pat2 = Chem.MolFromSmarts("[#7]=[#8]") >>> pat1.HasSubstructMatch(pat2) False >>> pat3 = Chem.MolFromSmarts("[#6]=[#7]") >>> pat1.HasSubstructMatch(pat3) True I've used this once in my work when I generated simple subgraph fragments as SMARTS patterns then used the patterns against themselves to generate a hierarchical tree. This would correspond roughly to checking if one regular expression is a subset of another, which is a very different algorithm than pattern matching a string. Andrew da...@dalkescientific.com -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Information contained in SMARTS and SMILES
One more thing. The term "Mol" in RDKit and some other tookits does not really mean "molecule" in the sense that chemists use it. It is used to connote a data structure that can store a SMARTS or a SMILES. Only when a SMILES is used does it really correspond to a chemical "molecule", except, in some cases, by accident; and, as Andrew pointed out, there are cases when exactly the same string means different things in a SMARTS and SMILES context. The way I think of it is that SMILES is like an ordinary string and SMARTS is like a regex that can be used to flexibly match other strings. -P. On Wed, Apr 19, 2017 at 5:20 PM, Andrew Dalkewrote: > On Apr 19, 2017, at 18:26, Curt Fischer wrote: > > From chemistry stack exchange, an answer contributed by user R.M.: > > > > SMARTS is deliberately designed to be a superset of SMILES. That is, any > valid SMILES depiction should also be a valid SMARTS query, one that will > retrieve the very structure that the SMILES string depicts. > > Except, that last clause isn't true. Try matching tritium against itself. > > >>> from rdkit import Chem > >>> mol = Chem.MolFromSmiles("[3H]") > >>> pat = Chem.MolFromSmarts("[3H]") > >>> mol.HasSubstructMatch(pat) > False > > For hydrogens you must use '#1', because H in SMARTS means something > different. > > >>> pat2 = Chem.MolFromSmarts("[3#1]") > >>> mol.HasSubstructMatch(pat2) > True > > SMILES input under Daylight and most other toolkits gets normalized to the > chemistry model, including aromaticity perception: > > >>> mol = Chem.MolFromSmiles("C1=CC=CC=C1") > >>> pat = Chem.MolFromSmarts("C1=CC=CC=C1") > >>> mol.HasSubstructMatch(pat) > False > >>> pat2 = Chem.MolFromSmarts("c1c1") > >>> mol.HasSubstructMatch(pat2) > True > > RDKit also does a small amount of additional normalization, or > 'sanitization' to use the RDKit term. For example, it will convert "neutral > 5 coordinate Ns with double bonds to Os to the zwitterionic form" (see > GraphMol/MolOps.cpp): > > >>> s = "CN(=O)=O" > >>> mol = Chem.MolFromSmiles(s) > >>> pat = Chem.MolFromSmarts(s) > >>> mol.HasSubstructMatch(pat) > False > >>> Chem.MolToSmiles(mol) > 'C[N+](=O)[O-]' > > I believe that the output SMILES from a toolkit, assuming that the SMILES > doesn't have an explicit hydrogen, can be used a SMARTS which will match > the molecule made from that same SMILES, by that same toolkit. > > This is a weaker statement than that made by user R.M. > > Andrew > da...@dalkescientific.com > > > > > -- > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > ___ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Information contained in SMARTS and SMILES
On Apr 19, 2017, at 18:26, Curt Fischerwrote: > From chemistry stack exchange, an answer contributed by user R.M.: > > SMARTS is deliberately designed to be a superset of SMILES. That is, any > valid SMILES depiction should also be a valid SMARTS query, one that will > retrieve the very structure that the SMILES string depicts. Except, that last clause isn't true. Try matching tritium against itself. >>> from rdkit import Chem >>> mol = Chem.MolFromSmiles("[3H]") >>> pat = Chem.MolFromSmarts("[3H]") >>> mol.HasSubstructMatch(pat) False For hydrogens you must use '#1', because H in SMARTS means something different. >>> pat2 = Chem.MolFromSmarts("[3#1]") >>> mol.HasSubstructMatch(pat2) True SMILES input under Daylight and most other toolkits gets normalized to the chemistry model, including aromaticity perception: >>> mol = Chem.MolFromSmiles("C1=CC=CC=C1") >>> pat = Chem.MolFromSmarts("C1=CC=CC=C1") >>> mol.HasSubstructMatch(pat) False >>> pat2 = Chem.MolFromSmarts("c1c1") >>> mol.HasSubstructMatch(pat2) True RDKit also does a small amount of additional normalization, or 'sanitization' to use the RDKit term. For example, it will convert "neutral 5 coordinate Ns with double bonds to Os to the zwitterionic form" (see GraphMol/MolOps.cpp): >>> s = "CN(=O)=O" >>> mol = Chem.MolFromSmiles(s) >>> pat = Chem.MolFromSmarts(s) >>> mol.HasSubstructMatch(pat) False >>> Chem.MolToSmiles(mol) 'C[N+](=O)[O-]' I believe that the output SMILES from a toolkit, assuming that the SMILES doesn't have an explicit hydrogen, can be used a SMARTS which will match the molecule made from that same SMILES, by that same toolkit. This is a weaker statement than that made by user R.M. Andrew da...@dalkescientific.com -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Information contained in SMARTS and SMILES
On Apr 19, 2017, at 12:03, Thilo Bauerwrote: > is converting SMARTS to SMILES a "lossless" operation, or does one loose > information on doing so? It is obviously not lossless if you include terms that cannot be represented in SMILES. >>> from rdkit import Chem >>> Chem.MolToSmiles(Chem.MolFromSmarts("[C,N]")) 'C' or which don't make sense as a molecule: >>> Chem.MolToSmiles(Chem.MolFromSmarts("c")) 'c' >>> Chem.MolFromSmiles("c") [23:02:24] non-ring atom 0 marked aromatic It also loses some information which could be represented in SMILES: >>> Chem.MolToSmiles(Chem.MolFromSmarts("[NH4+]")) 'N' >>> Chem.MolToSmiles(Chem.MolFromSmarts("C[N+]1(C)C1")) 'CN1(C)C1' >>> Chem.MolToSmiles(Chem.MolFromSmarts("[12C]"), isomericSmiles=True) 'C' Do be careful if you want to handle aromatic atoms and bonds: >>> Chem.MolToSmiles(Chem.MolFromSmarts("[#6]:1:[#6]:[#6]:[#6]:[#6]:[#6]:1")) 'C1:C:C:C:C:C:1' >>> Chem.MolToSmiles(Chem.MolFromSmarts("c=1-c=c-c=c-c=1")) 'c1=c-c=c-c=c-1' > Background: > I've got three different SMARTS strings representing the same structure > - at least when depicting it. Also all three strings result in the exact > same SMILES (see code and output below). It looks like you want SMARTS canonicalization. In general this is hard, because SMARTS can include boolean expressions and recursive SMARTS. If you limit yourself to patterns like '[#6]-1=[#6]-[#6]...', with only atomic numbers and single/double/triple bonds, then I think RDKit will do what you want. Andrew da...@dalkescientific.com -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Information contained in SMARTS and SMILES
Hi Thilo, Interesting question. rdkit-discuss members should know you also posted a very similar question to https://chemistry.stackexchange.com/questions/72880/is-converting-smarts-to-smiles-a-lossless-operation . If an interesting answer materializes here, it would be useful to post it there, and vice-versa. Curt On Wed, Apr 19, 2017 at 3:03 AM, Thilo Bauerwrote: > Dear mailinglist-members, > > is converting SMARTS to SMILES a "lossless" operation, or does one loose > information on doing so? > > Background: > I've got three different SMARTS strings representing the same structure > - at least when depicting it. Also all three strings result in the exact > same SMILES (see code and output below). > > Now, don't take this wrong, I do know the differences between SMARTS and > SMILES, and I do know what the symbols in SMARTS mean. I just wonder, > when I use either the threes SMARTS or the single SMILES as a pattern > for a substruct match, if there is a chance that I get different > results, or let's say if I would miss substructure occurences by using > the single SMILES? I could not make up a case where this happened. > > > >>> m = > Chem.MolFromSmarts('[#6]-1=[#6]-[#6](-[#6]-[#6](-[#6]-1)-[#6])=[#8]') > >>> Chem.MolToSmiles(m) > 'CC1CC=CC(=O)C1' > >>> m = Chem.MolFromSmarts('[#6]-1-[#6]=[#6]-[#6](-[#6]-[#6]-1-[#6] > )=[#8]') > >>> Chem.MolToSmiles(m) > 'CC1CC=CC(=O)C1' > >>> m = Chem.MolFromSmarts('[#6]-1-[#6](-[#6]=[#6]-[#6]-[#6]-1-[#6] > )=[#8]') > >>> Chem.MolToSmiles(m) > 'CC1CC=CC(=O)C1' > > > Thank's a lot in advance! > > Thilo > > > > > > > -- > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > ___ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
[Rdkit-discuss] Information contained in SMARTS and SMILES
Dear mailinglist-members, is converting SMARTS to SMILES a "lossless" operation, or does one loose information on doing so? Background: I've got three different SMARTS strings representing the same structure - at least when depicting it. Also all three strings result in the exact same SMILES (see code and output below). Now, don't take this wrong, I do know the differences between SMARTS and SMILES, and I do know what the symbols in SMARTS mean. I just wonder, when I use either the threes SMARTS or the single SMILES as a pattern for a substruct match, if there is a chance that I get different results, or let's say if I would miss substructure occurences by using the single SMILES? I could not make up a case where this happened. >>> m = Chem.MolFromSmarts('[#6]-1=[#6]-[#6](-[#6]-[#6](-[#6]-1)-[#6])=[#8]') >>> Chem.MolToSmiles(m) 'CC1CC=CC(=O)C1' >>> m = Chem.MolFromSmarts('[#6]-1-[#6]=[#6]-[#6](-[#6]-[#6]-1-[#6])=[#8]') >>> Chem.MolToSmiles(m) 'CC1CC=CC(=O)C1' >>> m = Chem.MolFromSmarts('[#6]-1-[#6](-[#6]=[#6]-[#6]-[#6]-1-[#6])=[#8]') >>> Chem.MolToSmiles(m) 'CC1CC=CC(=O)C1' Thank's a lot in advance! Thilo -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss