> On Oct 21, 2021, at 04:50, Ling Chan <lingtrek...@gmail.com> wrote:
> 
> I got the attached sdf. When I did a MolToSmiles, it gives me the following.
> 
> >>> for m in Chem.SDMolSupplier("pdb_structures/1q6k_ligand.sdf"):
> ...   print (Chem.MolToSmiles(m))
> ... 
> [CH3:0][C:0]([CH3:0])([CH3:0])[O:0][C:0](=[O:0])[NH:0][CH:0]([CH:0]=[O:0])[CH:0]1[CH2:0][CH2:0][CH2:0][CH2:0][CH2:0]1
> 
> Just wonder why does it not give something like
> O=C(OC(C)(C)C)NC(C=O)C1CCCCC1

The terms after the atom symbol in your atom block lines are center-justified 
(or left-justified, in the 2-digit mass difference term 'dd') instead of 
right-justified.

Here's a comparison of your first atom line, compared with the ctfile spec, and 
then compared with the round-trip through RDKit:

   74.0060   -9.5770  134.8660 N  0  0  0  0  0  0  0  0  0  0  0  0    <-- 
yours
xxxxx.xxxxyyyyy.yyyyzzzzz.zzzz aaaddcccssshhhbbbvvvHHHrrriiimmmnnneee   <-- spec
   74.0060   -9.5770  134.8660 N   0  0  0  0  0  0  0  0  0  0  0  0   <-- 
RDKit

Add a space after the atom symbol field ("aaa") and everything works.

What happened?

The ":0" in the SMILES string derives from the atom-atom mapping number, "mmm", 
in the SDF.

The relevant code from 
Code/GraphMol/FileParsers/MolFileParser.cpp::ParseMolFileAtomLine() is:

  if (text.size() >= 63 && text.substr(60, 3) != "  0") {
    int atomMapNumber = 0;
    try {
      atomMapNumber = FileParserUtils::toInt(text.substr(60, 3), true);
    } catch (boost::bad_lexical_cast &) {
      std::ostringstream errout;
      errout << "Cannot convert '" << text.substr(60, 3) << "' to int on line "
             << line;
      delete res;
      throw FileParseException(errout.str());
    }
    res->setProp(common_properties::molAtomMapNumber, atomMapNumber);
  }

This says that if the field isn't exactly "  0" then parse it as an integer and 
store it in the atom's molAtomMapNumber.

Since your " 0 " field isn't exactly "  0", it gets converted into the atom map 
value of 0.

I don't see an explicit statement in the spec about alignment in fields. It's 
clear the spec comes from a Fortran background, so these should be interpreted 
as "I2" and "I3", and right-justified.


By the way, if you pass your file through CDK you get:

org.openscience.cdk.io.MDLV2000Reader ERROR: Error while parsing line 5:    
74.0060   -9.5770  134.8660 N  0  0  0  0  0  0  0  0  0  0  0  0   -> invalid 
line length, 68:    74.0060   -9.5770  134.8660 N  0  0  0  0  0  0  0  0  0  0 
 0  0
org.openscience.cdk.io.iterator.IteratingSDFReader ERROR: Error while reading 
next molecule: invalid line length, 68:    74.0060   -9.5770  134.8660 N  0  0  
0  0  0  0  0  0  0  0  0  0

CDK's 
storage/ctab/src/main/java/org/openscience/cdk/io/MDLV2000Reader.java::readAtomFast()
 requires that either all characters of a field be present, or the end of line. 
Your line is 68 characters long because your last field is " 0" instead of the 
" 0 " needed to match the exact charge flag "eee".

Best regards,


                                Andrew
                                da...@dalkescientific.com




_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to