Re: [Rdkit-discuss] elimination of small fragments

Andrew Dalke Fri, 29 Jun 2018 03:54:01 -0700

On Jun 29, 2018, at 02:43, 藤秀義 <[email protected]> wrote:
> Although not strictly based on the number of atoms, but on the length of 
> SMILES string, the simplest way is using Python built-in functions as follows:
> 
> smiles = 'CCC.CC'
> fragment = max(smiles.split('.'), key=len)
> print (fragment)

The mmpdb package I helped develop includes some functions which work directly 
on the SMILES string. One of them uses a regular expression to count the number 
of heavy atoms, assuming the SMILES is valid and is written in 'normal' form, 
where the '.' is only used to distinguish between disconnected atoms (that is, 
things like "C1.C1" are not supported).

>>> smiles = "CCC.BrCl.[238U]"
>>> fragment = max(smiles.split('.'), key=len)
>>> fragment
'[238U]'

>>> from mmpdblib.fragment_algorithm import get_num_heavies_from_smiles
>>> fragment = max(smiles.split('.'), key=get_num_heavies_from_smiles)
>>> fragment
'CCC'

I keep meaning to put together a package with the various SMILES tricks that 
are possible without full chemistry perception.

On Jun 29, 2018, at 10:56, Ed Griffen <[email protected]> wrote:
> Using the string length to find the number of atoms in a molecule is OK - but 
> you need to take account of the additional characters in SMILES that are not 
> just atoms,
   ...
> Here’s a worked example:
> 
> >>> SMILES = 'C[S@@+]([O-])c1ccc(cc1)[Si](C)(C)C'
> >>> print(len(SMILES))
> 34
> >>> heavies = [char for char in SMILES if char not in 
> >>> '''()[]1234567890#:;,.?%-=+\/Hherlabdgfikmputvy@''']
> >>> print(len(heavies))
> 13

That's neat! But it doesn't always give the correct count.

>>> def count(smiles):
...   return sum(1 for c in smiles if c not in 
'''()[]1234567890#:;,.?%-=+\/Hherlabdgfikmputvy@''')
...
>>> count("[Hg]")
0
>>> count("[Zn]")
2
>>> count("[Tc]")
2
>>> count("[As]")
2

as well as for aromatic boron, as in:

>>> count("Cc1b[n+](C[n+]2cc(C)cc(C)c2)cc(C)c1")
16
>>> count("Cc1B[n+](C[n+]2cc(C)cc(C)c2)cc(C)c1")
17

and aromatic tellurium.

These came up in a cross-comparison I did using ChEMBL as a test set. I 
excluded records with [2H] and and [3H] because RDKit considers those to be 
heavy atoms while Ed's method does not.

The error rate is impressively low for such a simple approach, with only 467 
mismatches out of 1,726,695 cases.

Cheers,

                                Andrew
                                [email protected]

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] elimination of small fragments

Reply via email to