On Jun 29, 2018, at 02:43, 藤秀義 <hideyoshif...@gmail.com> wrote: > Although not strictly based on the number of atoms, but on the length of > SMILES string, the simplest way is using Python built-in functions as follows: > > smiles = 'CCC.CC' > fragment = max(smiles.split('.'), key=len) > print (fragment)
The mmpdb package I helped develop includes some functions which work directly on the SMILES string. One of them uses a regular expression to count the number of heavy atoms, assuming the SMILES is valid and is written in 'normal' form, where the '.' is only used to distinguish between disconnected atoms (that is, things like "C1.C1" are not supported). >>> smiles = "CCC.BrCl.[238U]" >>> fragment = max(smiles.split('.'), key=len) >>> fragment '[238U]' >>> from mmpdblib.fragment_algorithm import get_num_heavies_from_smiles >>> fragment = max(smiles.split('.'), key=get_num_heavies_from_smiles) >>> fragment 'CCC' I keep meaning to put together a package with the various SMILES tricks that are possible without full chemistry perception. On Jun 29, 2018, at 10:56, Ed Griffen <ed.grif...@medchemica.com> wrote: > Using the string length to find the number of atoms in a molecule is OK - but > you need to take account of the additional characters in SMILES that are not > just atoms, ... > Here’s a worked example: > > >>> SMILES = 'C[S@@+]([O-])c1ccc(cc1)[Si](C)(C)C' > >>> print(len(SMILES)) > 34 > >>> heavies = [char for char in SMILES if char not in > >>> '''()[]1234567890#:;,.?%-=+\/Hherlabdgfikmputvy@'''] > >>> print(len(heavies)) > 13 That's neat! But it doesn't always give the correct count. >>> def count(smiles): ... return sum(1 for c in smiles if c not in '''()[]1234567890#:;,.?%-=+\/Hherlabdgfikmputvy@''') ... >>> count("[Hg]") 0 >>> count("[Zn]") 2 >>> count("[Tc]") 2 >>> count("[As]") 2 as well as for aromatic boron, as in: >>> count("Cc1b[n+](C[n+]2cc(C)cc(C)c2)cc(C)c1") 16 >>> count("Cc1B[n+](C[n+]2cc(C)cc(C)c2)cc(C)c1") 17 and aromatic tellurium. These came up in a cross-comparison I did using ChEMBL as a test set. I excluded records with [2H] and and [3H] because RDKit considers those to be heavy atoms while Ed's method does not. The error rate is impressively low for such a simple approach, with only 467 mismatches out of 1,726,695 cases. Cheers, Andrew da...@dalkescientific.com ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot _______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss