Hi all,
Just noticed why some of the SMILES in my script don't parse: I
accidentally put brackets at the terminal carbons, my bad. Here is a fixed
script also with updated molecular weights
https://gist.github.com/dehaenw/bb5704fc4d108eec8f8e999d6ab79118
I looked into the different total amount of smiles with Andrew's nice and
general implementation. Immediately from the missing compounds it was clear
there was an error in the way my script dealt with capping of the atoms. In
the condition that both the first and the last atom bear two Cl atoms,
inadvertently one of the two options would not get added. Now the script
outputs 442849 parsable smiles. After canonicalization the amount is
reduced to 440334. This number is consistent with Andrew's result.
best wishes
wim

On Wed, Dec 8, 2021 at 2:59 PM Gyro Funch <gyromagne...@gmail.com> wrote:

> Hello Andrew,
>
> Thank you for developing and documenting this awesome script!
>
> I greatly appreciate you and the other helpful and generous folks on the
> mailing list who have taken the time to assist me. Feel free to take the
> rest of the day off.  ;-)
>
> Kind regards,
> Gyro
>
>
> On 2021-12-08 02:33 PM, Andrew Dalke wrote:
> > Hi Gyro,
> >
> >> On Dec 8, 2021, at 11:02, Gyro Funch <gyromagne...@gmail.com> wrote:
> >>
> >> My work is in the area of toxicology and I am interested in generating
> SMILES for molecules referred to as 'short chain chlorinated paraffins'
> (SCCP).
> >>
> >> A general definition that is sometimes used is that an SCCP is given by
> the molecular formula
> >>
> >> C_{x} H_{2x-y+2} Cl_{y}
> >>
> >> where
> >>
> >> x = 10-13
> >> y = 3-12
> >>
> >> and the average chlorine content ranges from 40-70% by mass.
> >>
> >> -----
> >>
> >> Can anyone provide guidance on how to generate the list of SMILES
> corresponding to the above rules?
> > Here's an alternate approach to the ones presented so far.
> >
> >    https://gist.github.com/adalke/e62df8774032560fef750fa9c88b6516
> >
> > Like Wim's version, it also generates the SMILES as the syntax level,
> though by default it use RDKit to generate canonical SMILES output. (use
> --no-canonical to disable the canonicalization step, which is also faster.)
> >
> > Here it is with 4 carbons and 3 chlorines.
> >
> > % python sccp_smiles.py --C 4 --Cl 3
> > Content range not specified. Using --min-content 0.4 and --max-content
> 0.7.
> > CCCC(Cl)(Cl)Cl
> > CCC(Cl)C(Cl)Cl
> > CCC(Cl)(Cl)CCl
> > CC(Cl)CC(Cl)Cl
> > CC(Cl)C(Cl)CCl
> > CC(Cl)C(C)(Cl)Cl
> > CC(Cl)(Cl)CCCl
> > ClCCCC(Cl)Cl
> > ClCCC(Cl)CCl
> >
> > The "--C" and "--Cl" are aliases for "--min-C" and "--min-Cl"; if the
> maximums are not specified then the maximum is set to the minimum.
> >
> > Here's a range using all the bells and whistles:
> >
> > % time python sccp_smiles.py --min-C 10 --max-C 13 --min-Cl 3 --max-Cl
> 12 --max-Cl-per-atom 2 --min-content 0.4 --max-content 0.7
> --no-canonicalize > example.smi
> > 2.030u 0.156s 0:02.44 89.3%   0+0k 0+0io 0pf+0w
> > % wc -l example.smi
> >    440334 example.smi
> >
> > Wim reported 437001 for the same configuration. I haven't figured out if
> the difference is due simply to differences in the molecular weight values.
> >
> > I couldn't canonicalize and pin down the differences in part because
> Wim's output generates SMILES strings that RDKit cannot parse:
> >
> > % grep '^[(]' CSSP.smi | head -4
> > (Cl)C(Cl)(Cl)CCCCCCCCC(Cl)(Cl)
> > (Cl)C(Cl)(Cl)CCCCCCCC(Cl)C(Cl)(Cl)
> > (Cl)C(Cl)(Cl)CCCCCCCC(Cl)(Cl)C(Cl)(Cl)
> > (Cl)C(Cl)(Cl)CCCCCCC(Cl)CC(Cl)(Cl)
> >
> >>>> from rdkit import Chem
> >>>> Chem.CanonSmiles("(Cl)C(Cl)(Cl)CCCCCCC(Cl)CC(Cl)(Cl)")
> > [14:31:12] SMILES Parse Error: syntax error while parsing:
> (Cl)C(Cl)(Cl)CCCCCCC(Cl)CC(Cl)(Cl)
> >
> > My code isn't well tested, but perhaps enough to get you on the way.
> >
> > Cheers,
> >
> >
> >                               Andrew
> >                               da...@dalkescientific.com
> >
> >
>
>
>
> _______________________________________________
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to