Hello, I am trying to process V3000 MolBlock's from some SD files, and I seem to encounter issues when enhanced stereochemistry information is present, depending on the source of the SD file.
To test that the molecule to SDF and back conversion within rdkit was working OK, I ran this code: import pandas as pd from rdkit import Chem from rdkit.Chem import Draw from rdkit.Chem import PandasTools # 1. convert to molecule a CXSMILES with encoded enhanced stereochemistry m = Chem.MolFromSmiles('O=C(NC[C@@H]1CC[C@H](C2=CC=CC=C2)O1)N[C@@H]1COC[C@@H]1O |&1:4,7,&2:16,20|') # check that the V3K molblock contains the enhanced stereochemistry information print(Chem.MolToV3KMolBlock(m)) # 2. write the molecule to an SDF writer = Chem.SDWriter('m_with_enh_stereo.sdf') writer.SetForceV3000(True) writer.write(m) writer.close() # 3. read the molecule back into a list ms with Chem.SDMolSupplier('m_with_enh_stereo.sdf') as SDF: ms = [m for m in SDF if m is not None] # check that the V3000 molblock is OK print(Chem.MolToV3KMolBlock(ms[0])) This worked well. The content of the SD file made by this script ('m_with_enh_stereo.sdf') was: RDKit 2D 0 0 0 0 0 0 0 0 0 0999 V3000 M V30 BEGIN CTAB M V30 COUNTS 22 24 0 0 0 M V30 BEGIN ATOM M V30 1 O 7.414605 -6.052405 0.000000 0 M V30 2 C 6.201079 -6.934083 0.000000 0 M V30 3 N 4.830761 -6.323978 0.000000 0 M V30 4 C 4.673969 -4.832195 0.000000 0 M V30 5 C 3.303650 -4.222090 0.000000 0 M V30 6 C 2.004612 -4.972090 0.000000 0 M V30 7 C 0.889895 -3.968394 0.000000 0 M V30 8 C 1.500000 -2.598076 0.000000 0 M V30 9 C 0.750000 -1.299038 0.000000 0 M V30 10 C 1.500000 0.000000 0.000000 0 M V30 11 C 0.750000 1.299038 0.000000 0 M V30 12 C -0.750000 1.299038 0.000000 0 M V30 13 C -1.500000 0.000000 0.000000 0 M V30 14 C -0.750000 -1.299038 0.000000 0 M V30 15 O 2.991783 -2.754869 0.000000 0 M V30 16 N 6.357872 -8.425866 0.000000 0 M V30 17 C 7.728190 -9.035971 0.000000 0 M V30 18 C 9.027228 -8.285971 0.000000 0 M V30 19 O 10.141946 -9.289667 0.000000 0 M V30 20 C 9.531841 -10.659985 0.000000 0 M V30 21 C 8.040058 -10.503192 0.000000 0 M V30 22 O 7.036362 -11.617910 0.000000 0 M V30 END ATOM M V30 BEGIN BOND M V30 1 2 1 2 M V30 2 1 2 3 M V30 3 1 3 4 M V30 4 1 5 4 CFG=3 M V30 5 1 5 6 M V30 6 1 6 7 M V30 7 1 8 7 CFG=3 M V30 8 1 8 9 M V30 9 2 9 10 M V30 10 1 10 11 M V30 11 2 11 12 M V30 12 1 12 13 M V30 13 2 13 14 M V30 14 1 8 15 M V30 15 1 2 16 M V30 16 1 17 16 CFG=3 M V30 17 1 17 18 M V30 18 1 18 19 M V30 19 1 19 20 M V30 20 1 20 21 M V30 21 1 21 22 CFG=3 M V30 22 1 15 5 M V30 23 1 21 17 M V30 24 1 14 9 M V30 END BOND M V30 BEGIN COLLECTION M V30 MDLV30/STERAC1 ATOMS=(2 5 8) M V30 MDLV30/STERAC2 ATOMS=(2 17 21) M V30 END COLLECTION M V30 END CTAB M END > <_CXSMILES_Data> (1) |&1:4,7,&2:16,20| $$$$ Then I tried reading an SD file for the exact same molecule, made by some other software. The content of that SD file ('mol_with_enhanced_stereo_2_And_groups.sdf') was: 2 And groups, from CXSMILES SciTegic04042214202D 0 0 0 0 0 0 999 V3000 M V30 BEGIN CTAB M V30 COUNTS 22 24 0 0 0 M V30 BEGIN ATOM M V30 1 O 7.4146 -6.05241 0 0 M V30 2 C 6.20108 -6.93408 0 0 M V30 3 N 4.83076 -6.32398 0 0 M V30 4 C 4.67397 -4.83219 0 0 M V30 5 C 3.30365 -4.22209 0 0 CFG=2 M V30 6 C 2.00461 -4.97209 0 0 M V30 7 C 0.8899 -3.96839 0 0 M V30 8 C 1.5 -2.59808 0 0 CFG=2 M V30 9 C 0.75 -1.29904 0 0 M V30 10 C 1.5 0 0 0 M V30 11 C 0.75 1.29904 0 0 M V30 12 C -0.75 1.29904 0 0 M V30 13 C -1.5 0 0 0 M V30 14 C -0.75 -1.29904 0 0 M V30 15 O 2.99178 -2.75487 0 0 M V30 16 N 6.35787 -8.42587 0 0 M V30 17 C 7.72819 -9.03597 0 0 CFG=2 M V30 18 C 9.02723 -8.28597 0 0 M V30 19 O 10.14195 -9.28967 0 0 M V30 20 C 9.53184 -10.65999 0 0 M V30 21 C 8.04006 -10.50319 0 0 CFG=2 M V30 22 O 7.03636 -11.61791 0 0 M V30 END ATOM M V30 BEGIN BOND M V30 1 2 1 2 M V30 2 1 2 3 M V30 3 1 3 4 M V30 4 1 5 4 CFG=3 M V30 5 1 5 6 M V30 6 1 6 7 M V30 7 1 8 7 CFG=3 M V30 8 1 8 9 M V30 9 2 9 10 M V30 10 1 10 11 M V30 11 2 11 12 M V30 12 1 12 13 M V30 13 2 13 14 M V30 14 1 8 15 M V30 15 1 2 16 M V30 16 1 17 16 CFG=3 M V30 17 1 17 18 M V30 18 1 18 19 M V30 19 1 19 20 M V30 20 1 20 21 M V30 21 1 21 22 CFG=3 M V30 22 1 15 5 M V30 23 1 21 17 M V30 24 1 14 9 M V30 END BOND M V30 BEGIN COLLECTION M V30 MDLV30/STERAC1 ATOMS=(2 5 8) M V30 MDLV30/STERAC2 ATOMS=(2 17 21) M V30 END COLLECTION M V30 END CTAB M END > <Name> 2 And groups, from CXSMILES $$$$ If I run this code: # 4. read the same molecule from an SDF made by different software into list ms2 with Chem.SDMolSupplier('mol_with_enhanced_stereo_2_And_groups.sdf') as SDF: ms2 = [m for m in SDF if m is not None] I get the error messages below, and the MolBlock is wrong (does not contain the enhanced stereochemistry information). RDKit WARNING: [16:51:59] Skipping unrecognized collection type at line 58: MDLV30/STERAC1 ATOMS=(2 5 8) RDKit WARNING: [16:51:59] Skipping unrecognized collection type at line 59: MDLV30/STERAC2 ATOMS=(2 17 21) [16:51:59] Skipping unrecognized collection type at line 58: MDLV30/STERAC1 ATOMS=(2 5 8) [16:51:59] Skipping unrecognized collection type at line 59: MDLV30/STERAC2 ATOMS=(2 17 21) > Does anybody know why this might be the case? > Is there something in the V3000 format in the second file that makes rdkit > not process it correctly? I compared them side by side, and the main differences I can see are the CFG flags added to the atom block, and the name in the first line. Hard to imagine how either of these things could have an impact on the collection block, which looks identical in the two SD files. I am using SD files made by that 'other software' in many other contexts, and they seem to be processed correctly. In fact I am also using those SD files for some work in rdkit; this test made me discover that I am losing information (the warnings often do not imply that, so I tend to ignore them, but in this case they do). Thanks This e-mail and its attachment(s) (if any) may contain confidential and/or proprietary information and is intended for its addressee(s) only. Any unauthorized use of the information contained herein (including, but not limited to, alteration, reproduction, communication, distribution or any other form of dissemination) is strictly prohibited. If you are not the intended addressee, please notify the originator promptly and delete this e-mail and its attachment(s) (if any) subsequently. Neither Galapagos nor any of its affiliates shall be liable for direct, special, indirect or consequential damages arising from alteration of the contents of this message (by a third party) or as a result of a virus being passed on.
_______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss