Re: [Rdkit-discuss] isotopic SMILES

2017-02-07 Thread Andrew Dalke
On Feb 7, 2017, at 22:26, Curt Fischer  wrote:
> def same_implicit_valence(mol_1, mol_2, atom_idx=1):
> """Returns True if mol_1 and mol_2 have the same implicit valence for the 
> indexed atom"""
> mol_1_implicitH = mol_1.GetAtomWithIdx(atom_idx).GetImplicitValence()
> mol_2_implicitH = mol_2.GetAtomWithIdx(atom_idx).GetImplicitValence()
> return mol_1_implicitH == mol_2_implicitH

They have have the same implicit valence but they have different numbers of 
explicit hydrogens.

>>> mol_1.GetAtomWithIdx(1).GetNumExplicitHs()
0
>>> mol_2.GetAtomWithIdx(1).GetNumExplicitHs()
2

They also have different InChI strings, as expected:

  InChI=1S/C2H4O/c1-2-3/h3H,1H3/i2+1
  InChI=1S/C2H6O/c1-2-3/h3H,2H2,1H3

RDKit and SMILES have different views on what "explicit" and "implicit" 
hydrogens mean.

In SMILES:
  [C]([H])([H])([H])[H] - 4 explicit hydrogens
  C - 4 implicit hydrogens, specified via valence (i.e., "implicitly specified")
  [CH4] - 4 implicit hydrogens, specified via notation (i.e., "explicitly 
specified")

I give the "i.e." terms to show how the term "implicit" can be used in two 
different ways. This is the long-term source of the ambiguity in what 
"implicit" and "explicit" mean.



In RDKit:

   C - valence of 4, no explicit hydrogens, therefore 4 implicit hydrogens
   needed to fill the valence:

>>> mol = Chem.MolFromSmiles("C")
>>> mol.GetAtomWithIdx(0).GetImplicitValence()
4
>>> mol.GetAtomWithIdx(0).GetNumExplicitHs()
0

   [CH4] - 4 explicit hydrogens, nothing else needed to fill the valence

>>> mol = Chem.MolFromSmiles("[CH4]")
>>> mol.GetAtomWithIdx(0).GetImplicitValence()
0
>>> mol.GetAtomWithIdx(0).GetNumExplicitHs()
4

[C]([H])([H])([H])[H] by default are turned into explicit hydrogens

>>> mol = Chem.MolFromSmiles("[C]([H])([H])([H])[H]")
>>> mol.GetNumAtoms()
1
>>> mol.GetAtomWithIdx(0).GetImplicitValence()
0
>>> mol.GetAtomWithIdx(0).GetNumExplicitHs()
4


The conversion is part of input sanitization. This can be disabled:
>>> mol = Chem.MolFromSmiles("[C]([H])([H])([H])[H]", sanitize=False)
>>> mol.GetNumAtoms()
5

BTW, your convert_to_smiles_via_inchi() will lose all atom ordering 
information. It works for your example because the structure is so simple and 
there are two carbons. However, consider:

>>> mol = Chem.MolFromSmiles("OCP")
>>> Chem.MolToInchi(mol)
'InChI=1S/CH5OP/c2-1-3/h2H,1,3H2'
>>> mol2 = Chem.MolFromInchi(Chem.MolToInchi(mol))
>>> [atom.GetAtomicNum() for atom in mol2.GetAtoms()]
[6, 8, 15]

As a SMILES this would be written

  C(O)P

where the atoms are re-ordered.

Andrew
da...@dalkescientific.com


--
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] isotopic SMILES

2017-02-07 Thread Curt Fischer
I think you've persuaded me that .SetIsotope() is the way to go...

I don't understand how that avoids any problem. How do you specify the
> target atom for that case?
> In any case, won't the InChI normalization affect some of your structures
> (e.g., detaching metals) and make it even harder to specify isotopes?


but just to clarify, here is the admittedly hackish via-InChI conversion I
had in mind.  The specification of isotopes still happens in smiles, which
I find easier for humans to grok than InChI.  {{Side note: the end
application here is modeling the labeling patterns in structurally complex
natural products that are biosynthesized from simple labeled substrates
(such as ethanol).  So I ultimately want to feed my labeled molecules to a
series of SMARTS reactions that will eventually lead to a complex
structure.  Given the vagaries of SMARTS reaction matching, I want to be
sure that all my reactions apply equally well to labeled and unlabeled
substrates.}}

def convert_to_smiles_via_inchi(smiles):
"""Make a molecule from SMILES but via Inchi"""
temp_mol = Chem.MolFromSmiles(smiles)
inchi = Chem.MolToInchi(temp_mol)
final_mol = Chem.MolFromInchi(inchi)
return final_mol

def same_implicit_valence(mol_1, mol_2, atom_idx=1):
"""Returns True if mol_1 and mol_2 have the same implicit valence for
the indexed atom"""
mol_1_implicitH = mol_1.GetAtomWithIdx(atom_idx).GetImplicitValence()
mol_2_implicitH = mol_2.GetAtomWithIdx(atom_idx).GetImplicitValence()
return mol_1_implicitH == mol_2_implicitH

etoh_v1 = 'C[13C]O'
etoh_v2 = 'CCO'

etoh_versions = [etoh_v1, etoh_v2]

via_inchi = [convert_to_smiles_via_inchi(mol) for mol in etoh_versions]
smiles_only = [Chem.MolFromSmiles(mol) for mol in etoh_versions]

# this works
assert same_implicit_valence(*via_inchi)

# this doesn't
assert same_implicit_valence(*smiles_only)
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] isotopic SMILES

2017-02-07 Thread Andrew Dalke
On Feb 7, 2017, at 19:02, Curt Fischer  wrote:
> My ultimate goal is an easy way to create rdkit molecules that have isotopic 
> substitutions but which are otherwise exactly the same as non-substituted 
> variants.  What's the best approach?  Is it to directly call .SetIsotope() 
> like I do above?

Yes.

There are other solutions. For example, you can ask RDKit to generate the 
SMILES with all explicit hydrogens.

>>> Chem.MolToSmiles(mol, allHsExplicit=True)
'[CH3][CH2][OH]'

It's then trivial to manipulate the SMILES syntax at the string level to set an 
isotope.


>  This requires figuring out the rdkit atom index of my target atom, which is 
> doable but perhaps (?) overly complicated?

It comes down to how you want to specify the target atom.

If it's the position of the atom in the SMILES string, then it's (almost) 
trivial - the atom index is the index of the term in the SMILES string, minus 
any '[H]' terms.

If you specify the target atom with a SMARTS then use RDKit's SMARTS matching 
to get the atom indices.


>   Converting to InChI and back seems to avoid the problem, I guess because 
> MolToInchi() has a removeHs parameter that defaults to True.  That also seems 
> a bit hack-ish but I'm not sure what the best approach is.

I don't understand how that avoids any problem. How do you specify the target 
atom for that case?

In any case, won't the InChI normalization affect some of your structures 
(e.g., detaching metals) and make it even harder to specify isotopes?



Andrew
da...@dalkescientific.com



--
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] isotopic SMILES

2017-02-07 Thread Curt Fischer
I replied to Andrew's very nice discussion of implicit hydrogens in SMILES
 but forgot to include the whole list.

Wow, thank you, that was very useful.  I didn't realize those nuances of
> SMILES.
>
> On the rdkit "side", the distinction made in Smiles between implicit and
> explicit hydrogens seems to live in the atom properties ImplicitValence and
> NumImplicitHs.  rdkit, unlike SMILES apparently, does not require that
> isotopically labeled atoms have an implicit valence explicitly specified.
>
> # print out some key properties of all atoms in a molecule
> def print_atom_info(mol):
> for atom in mol.GetAtoms():
> print(atom.GetIdx(), atom.GetSymbol(),
>   atom.GetIsotope(), atom.GetImplicitValence(),
>   atom.GetNumImplicitHs())
> print('\n')
> return
>
> # make some ethanol
> ethanol_v1 = Chem.MolFromSmiles('CCO')
>
> # label carbon #1 with C13
> ethanol_v1.GetAtomWithIdx(1).SetIsotope(13)
>
> # make more ethanol, labeled as it's created
> ethanol_v2 = Chem.MolFromSmiles('C[13C]O')
>
> # implicit valence is different for the two molecules
> # so is implicit
> show_atom_info(ethanol_v1)
> show_atom_info(ethanol_v2)
>
> Draw.MolsToGridImage([ethanol_v1, ethanol_v2])
>

I also didn't finish my thought here.  My ultimate goal is an easy way to
create rdkit molecules that have isotopic substitutions but which are
otherwise exactly the same as non-substituted variants.  What's the best
approach?  Is it to directly call .SetIsotope() like I do above?  This
requires figuring out the rdkit atom index of my target atom, which is
doable but perhaps (?) overly complicated?   Converting to InChI and back
seems to avoid the problem, I guess because MolToInchi() has a removeHs
parameter that defaults to True.  That also seems a bit hack-ish but I'm
not sure what the best approach is.

Curt
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] isotopic SMILES

2017-02-06 Thread Andrew Dalke
On Feb 7, 2017, at 01:17, Curt Fischer  wrote:
> I am confused by this behavior:
> 
> >>> labeled_etoh = Chem.MolFromSmiles('C[13C]O')
> >>> print(Chem.MolToSmiles(labeled_etoh))
> 
> C[C]O
> 
> >>> print(Chem.MolToSmiles(labeled_etoh, isomericSmiles=True))
> 
> C[13C]O
> 
> 1. Why are there any brackets at all in the first output?  Why not just 'CCO'?

The middle atom in "CCO" has two hydrogens. The middle atom in "C[C]O" has no 
hydrogens.

> 2. Is there any documentation anywhere that the "isomericSmiles" argument is 
> also an "isotopicSmiles" argument?

I don't believe so. A search via DuckDuckGo of rdkit.org finds only two 
irrelevant matches.

> I am also confused about when Chem.MolToSmiles() puts in H atoms in the 
> output.

SMILES has a short-hand notation to represent hydrogens. "[CH4]" and "C" are 
both methane.

When atom is described using brackets then the number of hydrogens must be 
specified with the H notation.

When an atom is described without brackets then the number of hydrogens is 
based on the permitted valence values. C has a valence of 4, -C- has two single 
bonds, so the middle carbon of CCO has two hydrogen bonds to complete the 
valence.

The output mechanism prefers to use the short-hand notation if possible. That 
isn't possible if the sum of hydrogens and bond types is different than one of 
the valence levels, or if there is an isotope, charge, chiral, etc., which 
requires the use of []s.

> 
> >>> three_hb1 = Chem.MolFromSmiles('C[13CH](O)C[13C](=O)O')
> >>> three_hb2 = Chem.MolFromSmiles('C[13C](O)C[13C](=O)O')
> >>> print(Chem.MolToSmiles(three_hb1, isomericSmiles=True))
> 
> C[13CH](O)C[13C](=O)O
> 
> >>> print(Chem.MolToSmiles(three_hb2, isomericSmiles=True))
> 
> C[13C](O)C[13C](=O)O
> 
> >>> print(Chem.MolToSmiles(three_hb1, isomericSmiles=False))
> 
> CC(O)CC(=O)O
> 
> >>> print(Chem.MolToSmiles(three_hb2, isomericSmiles=False))
> 
> C[C](O)CC(=O)O
> 
> 3. Why are there no brackets for three_hb1 output, but there are for 
> three_hb2?


I think you mean "for the isomericSmiles=False" output? The first three_hb1 
output has brackets.

The isotope notation requires []s, so the option of using the short-hand 
notation doesn't exist. In that case the number of hydrogens must be specified 
as otherwise it means the atom has no hydrogens.


> 4. As far as I can tell, the two three_hb molecules are identical.   Why 
> aren't all Hs removed during canonicalization?

The second atom in three_hb1 has 1 hydrogen and three single bonds.

The second atom in three_hb2 has 0 hydrogens and three single bonds.

They are different structures so have different SMILES.

Cheers,

Andrew
da...@dalkescientific.com



--
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] isotopic SMILES

2017-02-06 Thread Curt Fischer
Hellow rdkit users,

What behavior should we expect for Chem.MolToSmiles() when dealing with
isotopically substituted molecules?


I am confused by this behavior:

>>> labeled_etoh = Chem.MolFromSmiles('C[13C]O')
>>> print(Chem.MolToSmiles(labeled_etoh))

C[C]O


>>> print(Chem.MolToSmiles(labeled_etoh, isomericSmiles=True))

C[13C]O


1. Why are there any brackets at all in the first output?  Why not just 'CCO
'?
2. Is there any documentation anywhere that the "isomericSmiles" argument
is also an "isotopicSmiles" argument?

I am also confused about when Chem.MolToSmiles() puts in H atoms in the
output.

>>> three_hb1 = Chem.MolFromSmiles('C[13CH](O)C[13C](=O)O')
>>> three_hb2 = Chem.MolFromSmiles('C[13C](O)C[13C](=O)O')
>>> print(Chem.MolToSmiles(three_hb1, isomericSmiles=True))


C[13CH](O)C[13C](=O)O


>>> print(Chem.MolToSmiles(three_hb2, isomericSmiles=True))

C[13C](O)C[13C](=O)O


>>> print(Chem.MolToSmiles(three_hb1, isomericSmiles=False))

CC(O)CC(=O)O


>>> print(Chem.MolToSmiles(three_hb2, isomericSmiles=False))


C[C](O)CC(=O)O


3. Why are there no brackets for three_hb1 output, but there are for
three_hb2?
4. As far as I can tell, the two three_hb molecules are identical.   Why
aren't all Hs removed during canonicalization?

Curt
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss