Hi Gyro, > On Dec 8, 2021, at 11:02, Gyro Funch <gyromagne...@gmail.com> wrote: > > My work is in the area of toxicology and I am interested in generating SMILES > for molecules referred to as 'short chain chlorinated paraffins' (SCCP). > > A general definition that is sometimes used is that an SCCP is given by the > molecular formula > > C_{x} H_{2x-y+2} Cl_{y} > > where > > x = 10-13 > y = 3-12 > > and the average chlorine content ranges from 40-70% by mass. > > ----- > > Can anyone provide guidance on how to generate the list of SMILES > corresponding to the above rules?
Here's an alternate approach to the ones presented so far. https://gist.github.com/adalke/e62df8774032560fef750fa9c88b6516 Like Wim's version, it also generates the SMILES as the syntax level, though by default it use RDKit to generate canonical SMILES output. (use --no-canonical to disable the canonicalization step, which is also faster.) Here it is with 4 carbons and 3 chlorines. % python sccp_smiles.py --C 4 --Cl 3 Content range not specified. Using --min-content 0.4 and --max-content 0.7. CCCC(Cl)(Cl)Cl CCC(Cl)C(Cl)Cl CCC(Cl)(Cl)CCl CC(Cl)CC(Cl)Cl CC(Cl)C(Cl)CCl CC(Cl)C(C)(Cl)Cl CC(Cl)(Cl)CCCl ClCCCC(Cl)Cl ClCCC(Cl)CCl The "--C" and "--Cl" are aliases for "--min-C" and "--min-Cl"; if the maximums are not specified then the maximum is set to the minimum. Here's a range using all the bells and whistles: % time python sccp_smiles.py --min-C 10 --max-C 13 --min-Cl 3 --max-Cl 12 --max-Cl-per-atom 2 --min-content 0.4 --max-content 0.7 --no-canonicalize > example.smi 2.030u 0.156s 0:02.44 89.3% 0+0k 0+0io 0pf+0w % wc -l example.smi 440334 example.smi Wim reported 437001 for the same configuration. I haven't figured out if the difference is due simply to differences in the molecular weight values. I couldn't canonicalize and pin down the differences in part because Wim's output generates SMILES strings that RDKit cannot parse: % grep '^[(]' CSSP.smi | head -4 (Cl)C(Cl)(Cl)CCCCCCCCC(Cl)(Cl) (Cl)C(Cl)(Cl)CCCCCCCC(Cl)C(Cl)(Cl) (Cl)C(Cl)(Cl)CCCCCCCC(Cl)(Cl)C(Cl)(Cl) (Cl)C(Cl)(Cl)CCCCCCC(Cl)CC(Cl)(Cl) >>> from rdkit import Chem >>> Chem.CanonSmiles("(Cl)C(Cl)(Cl)CCCCCCC(Cl)CC(Cl)(Cl)") [14:31:12] SMILES Parse Error: syntax error while parsing: (Cl)C(Cl)(Cl)CCCCCCC(Cl)CC(Cl)(Cl) My code isn't well tested, but perhaps enough to get you on the way. Cheers, Andrew da...@dalkescientific.com _______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss