Hello Andrew,
Thank you for developing and documenting this awesome script!
I greatly appreciate you and the other helpful and generous folks on the
mailing list who have taken the time to assist me. Feel free to take the
rest of the day off. ;-)
Kind regards,
Gyro
On 2021-12-08 02:33 PM, Andrew Dalke wrote:
Hi Gyro,
On Dec 8, 2021, at 11:02, Gyro Funch <gyromagne...@gmail.com> wrote:
My work is in the area of toxicology and I am interested in generating SMILES
for molecules referred to as 'short chain chlorinated paraffins' (SCCP).
A general definition that is sometimes used is that an SCCP is given by the
molecular formula
C_{x} H_{2x-y+2} Cl_{y}
where
x = 10-13
y = 3-12
and the average chlorine content ranges from 40-70% by mass.
-----
Can anyone provide guidance on how to generate the list of SMILES corresponding
to the above rules?
Here's an alternate approach to the ones presented so far.
https://gist.github.com/adalke/e62df8774032560fef750fa9c88b6516
Like Wim's version, it also generates the SMILES as the syntax level, though by
default it use RDKit to generate canonical SMILES output. (use --no-canonical
to disable the canonicalization step, which is also faster.)
Here it is with 4 carbons and 3 chlorines.
% python sccp_smiles.py --C 4 --Cl 3
Content range not specified. Using --min-content 0.4 and --max-content 0.7.
CCCC(Cl)(Cl)Cl
CCC(Cl)C(Cl)Cl
CCC(Cl)(Cl)CCl
CC(Cl)CC(Cl)Cl
CC(Cl)C(Cl)CCl
CC(Cl)C(C)(Cl)Cl
CC(Cl)(Cl)CCCl
ClCCCC(Cl)Cl
ClCCC(Cl)CCl
The "--C" and "--Cl" are aliases for "--min-C" and "--min-Cl"; if the maximums
are not specified then the maximum is set to the minimum.
Here's a range using all the bells and whistles:
% time python sccp_smiles.py --min-C 10 --max-C 13 --min-Cl 3 --max-Cl 12
--max-Cl-per-atom 2 --min-content 0.4 --max-content 0.7 --no-canonicalize >
example.smi
2.030u 0.156s 0:02.44 89.3% 0+0k 0+0io 0pf+0w
% wc -l example.smi
440334 example.smi
Wim reported 437001 for the same configuration. I haven't figured out if the
difference is due simply to differences in the molecular weight values.
I couldn't canonicalize and pin down the differences in part because Wim's
output generates SMILES strings that RDKit cannot parse:
% grep '^[(]' CSSP.smi | head -4
(Cl)C(Cl)(Cl)CCCCCCCCC(Cl)(Cl)
(Cl)C(Cl)(Cl)CCCCCCCC(Cl)C(Cl)(Cl)
(Cl)C(Cl)(Cl)CCCCCCCC(Cl)(Cl)C(Cl)(Cl)
(Cl)C(Cl)(Cl)CCCCCCC(Cl)CC(Cl)(Cl)
from rdkit import Chem
Chem.CanonSmiles("(Cl)C(Cl)(Cl)CCCCCCC(Cl)CC(Cl)(Cl)")
[14:31:12] SMILES Parse Error: syntax error while parsing:
(Cl)C(Cl)(Cl)CCCCCCC(Cl)CC(Cl)(Cl)
My code isn't well tested, but perhaps enough to get you on the way.
Cheers,
Andrew
da...@dalkescientific.com
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss