Hello Andrew,

Thank you for developing and documenting this awesome script!

I greatly appreciate you and the other helpful and generous folks on the mailing list who have taken the time to assist me. Feel free to take the rest of the day off.  ;-)

Kind regards,
Gyro


On 2021-12-08 02:33 PM, Andrew Dalke wrote:
Hi Gyro,

On Dec 8, 2021, at 11:02, Gyro Funch <gyromagne...@gmail.com> wrote:

My work is in the area of toxicology and I am interested in generating SMILES 
for molecules referred to as 'short chain chlorinated paraffins' (SCCP).

A general definition that is sometimes used is that an SCCP is given by the 
molecular formula

C_{x} H_{2x-y+2} Cl_{y}

where

x = 10-13
y = 3-12

and the average chlorine content ranges from 40-70% by mass.

-----

Can anyone provide guidance on how to generate the list of SMILES corresponding 
to the above rules?
Here's an alternate approach to the ones presented so far.

   https://gist.github.com/adalke/e62df8774032560fef750fa9c88b6516

Like Wim's version, it also generates the SMILES as the syntax level, though by 
default it use RDKit to generate canonical SMILES output. (use --no-canonical 
to disable the canonicalization step, which is also faster.)

Here it is with 4 carbons and 3 chlorines.

% python sccp_smiles.py --C 4 --Cl 3
Content range not specified. Using --min-content 0.4 and --max-content 0.7.
CCCC(Cl)(Cl)Cl
CCC(Cl)C(Cl)Cl
CCC(Cl)(Cl)CCl
CC(Cl)CC(Cl)Cl
CC(Cl)C(Cl)CCl
CC(Cl)C(C)(Cl)Cl
CC(Cl)(Cl)CCCl
ClCCCC(Cl)Cl
ClCCC(Cl)CCl

The "--C" and "--Cl" are aliases for "--min-C" and "--min-Cl"; if the maximums 
are not specified then the maximum is set to the minimum.

Here's a range using all the bells and whistles:

% time python sccp_smiles.py --min-C 10 --max-C 13 --min-Cl 3 --max-Cl 12 
--max-Cl-per-atom 2 --min-content 0.4 --max-content 0.7 --no-canonicalize > 
example.smi
2.030u 0.156s 0:02.44 89.3%     0+0k 0+0io 0pf+0w
% wc -l example.smi
   440334 example.smi

Wim reported 437001 for the same configuration. I haven't figured out if the 
difference is due simply to differences in the molecular weight values.

I couldn't canonicalize and pin down the differences in part because Wim's 
output generates SMILES strings that RDKit cannot parse:

% grep '^[(]' CSSP.smi | head -4
(Cl)C(Cl)(Cl)CCCCCCCCC(Cl)(Cl)
(Cl)C(Cl)(Cl)CCCCCCCC(Cl)C(Cl)(Cl)
(Cl)C(Cl)(Cl)CCCCCCCC(Cl)(Cl)C(Cl)(Cl)
(Cl)C(Cl)(Cl)CCCCCCC(Cl)CC(Cl)(Cl)

from rdkit import Chem
Chem.CanonSmiles("(Cl)C(Cl)(Cl)CCCCCCC(Cl)CC(Cl)(Cl)")
[14:31:12] SMILES Parse Error: syntax error while parsing: 
(Cl)C(Cl)(Cl)CCCCCCC(Cl)CC(Cl)(Cl)

My code isn't well tested, but perhaps enough to get you on the way.

Cheers,


                                Andrew
                                da...@dalkescientific.com





_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to