Re: [Rdkit-discuss] sd file format question

Andrew Dalke Fri, 02 Oct 2020 14:26:47 -0700

Hi Markus,

> On Oct 2, 2020, at 19:56, Markus Metz <metm...@gmail.com> wrote:
> I have a question to the sd file format.
> When I write charged molecules via rdkit I noticed that the charge definition 
> in the atom block is not written.
> The charge is written at the end of the entry.
> So far this worked perfectly fine for me.



The ctfile documentation I have from 2011 says this of the charge definition in 
the atom block:

   Wider range of values in M CHG and M RAD lines below. Retained
   for compatibility with older Ctabs, M CHG and M RAD lines take
   precedence.

and

   With Ctab version V2000, the dd and ccc fields have been
   superseded by the M ISO, M CHG, and M RAD lines in the properties
   block, described below. For compatibility, all releases since ISIS 1.0:

    • Write appropriate values in both places if the values
      are in the old range.

    • Use the atom block fields if there are no M  ISO, M  CHG, or
      M  RAD lines in the properties block.

   Support for these atom block fields might be removed in future
   releases of Symyx software.

Further, I looked into this when I wrote the blog post 
http://www.dalkescientific.com/writings/diary/archive/2020/09/25/mixing_text_and_chemistry_toolkits.html
 a couple of week ago, and found the 1992 JCICS paper "Description of Several 
Chemical Structure File Formats Used by Computer Programs Developed at 
Molecular Design Limited" by Dalby et al. has the "Wider range ... Retained for 
compatibility with older Ctabs" in it.

So including the charge in the atom block as well as in the properties block is 
a 28+ year old backwards compatibility practice.


> Now, I am using a program which reads the atom block charge info only.
> Is there a way in rdkit to enable the charge written in the atom block?

No. The code in Code/GraphMol/FileParsers/MolFileWriter.cpp has it hard-coded 
to 0.

> Do you have any thoughts on this?

The two I can think of are:
  - post-processing to add it back in,
  - pass it through another toolkit which adds the duplicated charge information


I've attached a program for the first of these options. The command-line tools 
reads an SDF and generates a new SDF with the "M  CHG" lines added to the atom 
block. Here's the --help:

===================
usage: set_atom_block_charges.py [-h] [--output FILENAME] [--roundtrip] 
[--verify] [--no-set] [FILENAME]

copy charge information from the 'M CHG' data line to the atom block

positional arguments:
  FILENAME              input filename (default: stdin)

optional arguments:
  -h, --help            show this help message and exit
  --output FILENAME, -o FILENAME
                        output file name (default: stdout)
  --roundtrip           use RDKit to parse the record and regenerate the SDF 
record
  --verify              ensure the input and output SMILES match
  --no-set              don't set the charges (useful if you want to see the 
round-trip output)
===================

This depends on the latest commercial version chemfp to identify records in an 
SDF and to help with the verification.

While chemfp is not open source, the base license lets you use this 
functionality for in-house use. (See the file for installation details; the 
pre-compiled package only installs on Linux-based OSes.)

Or, you can grab set_atom_block_charges() from the code (and some code it 
depends on) so you don't need chemfp at all.

In the following, I round-trip the input through RDKit but don't set the atom 
block charges:

% python set_atom_block_charges.py piperidine.sdf --roundtrip --no-set
piperidine
     RDKit          3D

  6  6  0  0  1  0  0  0  0  0999 V2000
   -1.4650    0.7843   -0.9210 N   0  0  0  0  0  0  0  0  0  0  0  0
    0.0601    0.7265   -0.6801 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.6663   -0.3976   -1.5418 C   0  0  0  0  0  0  0  0  0  0  0  0
   -0.0188   -1.7539   -1.2886 C   0  0  0  0  0  0  0  0  0  0  0  0
   -1.5436   -1.6645   -1.4884 C   0  0  0  0  0  0  0  0  0  0  0  0
   -2.1760   -0.5554   -0.6261 C   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  0
  1  6  1  0
  2  3  1  0
  3  4  1  0
  4  5  1  0
  5  6  1  0
M  CHG  1   1   1
M  END
$$$$

In the following, I round-trip it through RDKit then let the tool set the 
charges in the atom block.

% python set_atom_block_charges.py piperidine.sdf --roundtrip
piperidine
     RDKit          3D

  6  6  0  0  1  0  0  0  0  0999 V2000
   -1.4650    0.7843   -0.9210 N   0  3  0  0  0  0  0  0  0  0  0  0
    0.0601    0.7265   -0.6801 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.6663   -0.3976   -1.5418 C   0  0  0  0  0  0  0  0  0  0  0  0
   -0.0188   -1.7539   -1.2886 C   0  0  0  0  0  0  0  0  0  0  0  0
   -1.5436   -1.6645   -1.4884 C   0  0  0  0  0  0  0  0  0  0  0  0
   -2.1760   -0.5554   -0.6261 C   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  0
  1  6  1  0
  2  3  1  0
  3  4  1  0
  4  5  1  0
  5  6  1  0
M  CHG  1   1   1
M  END
$$$$

You can also use the --verify flag to generate and compare the SMILES strings 
before and after the conversion.

Best regards,


                                Andrew
                                da...@dalkescientific.com

# Copy charges from the "M  CHG" data lines to the atom block
# Written by Andrew Dalke, 2 October 2020
import argparse
import sys
import gzip

# This requires chemfp. See https://chemfp.com/license/ for licenses.
# The default license is the "chemfp Base License"
from chemfp import text_toolkit # For Linux: python -m pip install chemfp -i https://chemfp.com/packages/

# From the documentation:
#   0 = uncharged or value other than these, 1 = +3, 2 = +2, 3 = +1,
#   4 = doublet radical, 5 = -1, 6 = -2, 7 = -3
_chg_to_ccc_table = {
    +3: 1,
    +2: 2,
    +1: 3,
    -1: 5,
    -2: 6,
    -3: 7,
    }

class ChargeError(Exception):
    pass
    
def set_atom_block_charges(record, verify=False):
    lines = record.splitlines(True)
    num_atoms = int(lines[3][:3])
    num_bonds = int(lines[3][3:6])

    for line in lines[3 + num_atoms + num_bonds:]:
        if line.startswith(b"M  END"):
            break

        if not line.startswith(b"M  CHG"):
            continue

        # Process each charge assignment on this line
        try:
            num_fields = int(line[6:9])

            for field_index in range(num_fields):
                offset = 10 + 8*field_index
                atom_index = int(line[offset:offset+3])
                chg = int(line[offset+4:offset+7])

                # 0 means uncharged or unsupported charge
                ccc = _chg_to_ccc_table.get(chg, 0)

                # Update the correct atom line to have the charge
                atom_line = lines[3 + atom_index]
                atom_line = b"%s%3d%s" % (atom_line[:36], ccc, atom_line[39:])
                lines[3 + atom_index] = atom_line

        except (ValueError, IndexError) as err:
            # This shouldn't happen unless I did something wrong.
            raise ChargeError("Unable to process charges in %r" % (line,))
        
        if verify:
            # Remove the "M  CHG" line
            i = lines.index(line)
            lines[i] = b""

    return b"".join(lines)

### Handle command-line processing

parser = argparse.ArgumentParser(
    description="copy charge information from the 'M  CHG' data line to the atom block")
parser.add_argument("--output", "-o", metavar="FILENAME",
                        help="output file name (default: stdout)")
parser.add_argument("--roundtrip", action="store_true",
                        help="use RDKit to parse the record and regenerate the SDF record")
parser.add_argument("--verify", action="store_true",
                        help="ensure the input and output SMILES match")
parser.add_argument("--no-set", action="store_true",
                        help="don't set the charges (useful if you want to see the round-trip output)")
parser.add_argument("input", nargs="?", metavar="FILENAME",
                        help="input filename (default: stdin)")


def open_binary_output(filename):
    if filename is None:
        return getattr(sys.stdout, "buffer", sys.stdout) # support Python 2 and 3
    elif filename.endswith(".gz"):
        return gzip.open(filename, "wb")
    else:
        return open(filename, "wb")

def main():
    args = parser.parse_args()

    verify = args.verify
    roundtrip = args.roundtrip
    no_set = args.no_set
    
    if verify or roundtrip:
        from chemfp import rdkit_toolkit as T
        

    with text_toolkit.read_sdf_records(args.input) as record_reader:
        outfile = open_binary_output(args.output)
        
        for record in record_reader:
            if roundtrip:
                mol = T.parse_molecule(record, "sdf", errors="ignore")
                if mol is None:
                    sys.stderr.write("Warning: Could not process molecule record: %s -- skipping\n" % (
                        record_reader.location.where(),))
                    continue
                record = T.create_bytes(mol, "sdf")
                
            if verify:
                # Get a reference SMILES string
                smi1 = T.create_string(T.parse_molecule(record, "sdf"), "smistring")

            if no_set:
                output = record
            else:
                try:
                    output = set_atom_block_charges(record, verify)
                except ChargeError as err:
                    sys.stderr.write("ERROR: %s: %s\n" % (err, record_reader.location.where()))
                    raise SystemExit("Exiting.")
            
            if verify:
                # Ensure the input and output SMILES strings match
                smi2 = T.create_string(T.parse_molecule(output, "sdf"), "smistring")
                if smi1 != smi2:
                    sys.stderr.write("Could not adjust charges correctly: %s\n" % (
                        record_reader.location.where()))
                    sys.stderr.write("Original SMILES: %r\n" % (smi1,))
                    sys.stderr.write("Changed SMILES : %r\n" % (smi2,))
                    sys.stderr.write("Skipping.\n")
                    continue
            
            outfile.write(output)
        
if __name__ == "__main__":
    import signal
    signal.signal(signal.SIGINT, signal.SIG_DFL)  # Allow ^C to kill the process.
    signal.signal(signal.SIGPIPE, signal.SIG_DFL) # Allow the output pipe to be closed
    main()

_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] sd file format question

Reply via email to