Hi Gyro,

> On Dec 8, 2021, at 11:02, Gyro Funch <gyromagne...@gmail.com> wrote:
> 
> My work is in the area of toxicology and I am interested in generating SMILES 
> for molecules referred to as 'short chain chlorinated paraffins' (SCCP).
> 
> A general definition that is sometimes used is that an SCCP is given by the 
> molecular formula
> 
> C_{x} H_{2x-y+2} Cl_{y}
> 
> where
> 
> x = 10-13
> y = 3-12
> 
> and the average chlorine content ranges from 40-70% by mass.
> 
> -----
> 
> Can anyone provide guidance on how to generate the list of SMILES 
> corresponding to the above rules?

Here's an alternate approach to the ones presented so far.

  https://gist.github.com/adalke/e62df8774032560fef750fa9c88b6516

Like Wim's version, it also generates the SMILES as the syntax level, though by 
default it use RDKit to generate canonical SMILES output. (use --no-canonical 
to disable the canonicalization step, which is also faster.)

Here it is with 4 carbons and 3 chlorines. 

% python sccp_smiles.py --C 4 --Cl 3
Content range not specified. Using --min-content 0.4 and --max-content 0.7.
CCCC(Cl)(Cl)Cl
CCC(Cl)C(Cl)Cl
CCC(Cl)(Cl)CCl
CC(Cl)CC(Cl)Cl
CC(Cl)C(Cl)CCl
CC(Cl)C(C)(Cl)Cl
CC(Cl)(Cl)CCCl
ClCCCC(Cl)Cl
ClCCC(Cl)CCl

The "--C" and "--Cl" are aliases for "--min-C" and "--min-Cl"; if the maximums 
are not specified then the maximum is set to the minimum.

Here's a range using all the bells and whistles:

% time python sccp_smiles.py --min-C 10 --max-C 13 --min-Cl 3 --max-Cl 12 
--max-Cl-per-atom 2 --min-content 0.4 --max-content 0.7 --no-canonicalize > 
example.smi
2.030u 0.156s 0:02.44 89.3%     0+0k 0+0io 0pf+0w
% wc -l example.smi
  440334 example.smi

Wim reported 437001 for the same configuration. I haven't figured out if the 
difference is due simply to differences in the molecular weight values.

I couldn't canonicalize and pin down the differences in part because Wim's 
output generates SMILES strings that RDKit cannot parse:

% grep '^[(]' CSSP.smi | head -4
(Cl)C(Cl)(Cl)CCCCCCCCC(Cl)(Cl)
(Cl)C(Cl)(Cl)CCCCCCCC(Cl)C(Cl)(Cl)
(Cl)C(Cl)(Cl)CCCCCCCC(Cl)(Cl)C(Cl)(Cl)
(Cl)C(Cl)(Cl)CCCCCCC(Cl)CC(Cl)(Cl)

>>> from rdkit import Chem
>>> Chem.CanonSmiles("(Cl)C(Cl)(Cl)CCCCCCC(Cl)CC(Cl)(Cl)")
[14:31:12] SMILES Parse Error: syntax error while parsing: 
(Cl)C(Cl)(Cl)CCCCCCC(Cl)CC(Cl)(Cl)

My code isn't well tested, but perhaps enough to get you on the way.

Cheers,


                                Andrew
                                da...@dalkescientific.com




_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to