Hi Markus,
> On Oct 2, 2020, at 19:56, Markus Metz <[email protected]> wrote:
> I have a question to the sd file format.
> When I write charged molecules via rdkit I noticed that the charge definition
> in the atom block is not written.
> The charge is written at the end of the entry.
> So far this worked perfectly fine for me.
The ctfile documentation I have from 2011 says this of the charge definition in
the atom block:
Wider range of values in M CHG and M RAD lines below. Retained
for compatibility with older Ctabs, M CHG and M RAD lines take
precedence.
and
With Ctab version V2000, the dd and ccc fields have been
superseded by the M ISO, M CHG, and M RAD lines in the properties
block, described below. For compatibility, all releases since ISIS 1.0:
• Write appropriate values in both places if the values
are in the old range.
• Use the atom block fields if there are no M ISO, M CHG, or
M RAD lines in the properties block.
Support for these atom block fields might be removed in future
releases of Symyx software.
Further, I looked into this when I wrote the blog post
http://www.dalkescientific.com/writings/diary/archive/2020/09/25/mixing_text_and_chemistry_toolkits.html
a couple of week ago, and found the 1992 JCICS paper "Description of Several
Chemical Structure File Formats Used by Computer Programs Developed at
Molecular Design Limited" by Dalby et al. has the "Wider range ... Retained for
compatibility with older Ctabs" in it.
So including the charge in the atom block as well as in the properties block is
a 28+ year old backwards compatibility practice.
> Now, I am using a program which reads the atom block charge info only.
> Is there a way in rdkit to enable the charge written in the atom block?
No. The code in Code/GraphMol/FileParsers/MolFileWriter.cpp has it hard-coded
to 0.
> Do you have any thoughts on this?
The two I can think of are:
- post-processing to add it back in,
- pass it through another toolkit which adds the duplicated charge information
I've attached a program for the first of these options. The command-line tools
reads an SDF and generates a new SDF with the "M CHG" lines added to the atom
block. Here's the --help:
===================
usage: set_atom_block_charges.py [-h] [--output FILENAME] [--roundtrip]
[--verify] [--no-set] [FILENAME]
copy charge information from the 'M CHG' data line to the atom block
positional arguments:
FILENAME input filename (default: stdin)
optional arguments:
-h, --help show this help message and exit
--output FILENAME, -o FILENAME
output file name (default: stdout)
--roundtrip use RDKit to parse the record and regenerate the SDF
record
--verify ensure the input and output SMILES match
--no-set don't set the charges (useful if you want to see the
round-trip output)
===================
This depends on the latest commercial version chemfp to identify records in an
SDF and to help with the verification.
While chemfp is not open source, the base license lets you use this
functionality for in-house use. (See the file for installation details; the
pre-compiled package only installs on Linux-based OSes.)
Or, you can grab set_atom_block_charges() from the code (and some code it
depends on) so you don't need chemfp at all.
In the following, I round-trip the input through RDKit but don't set the atom
block charges:
% python set_atom_block_charges.py piperidine.sdf --roundtrip --no-set
piperidine
RDKit 3D
6 6 0 0 1 0 0 0 0 0999 V2000
-1.4650 0.7843 -0.9210 N 0 0 0 0 0 0 0 0 0 0 0 0
0.0601 0.7265 -0.6801 C 0 0 0 0 0 0 0 0 0 0 0 0
0.6663 -0.3976 -1.5418 C 0 0 0 0 0 0 0 0 0 0 0 0
-0.0188 -1.7539 -1.2886 C 0 0 0 0 0 0 0 0 0 0 0 0
-1.5436 -1.6645 -1.4884 C 0 0 0 0 0 0 0 0 0 0 0 0
-2.1760 -0.5554 -0.6261 C 0 0 0 0 0 0 0 0 0 0 0 0
1 2 1 0
1 6 1 0
2 3 1 0
3 4 1 0
4 5 1 0
5 6 1 0
M CHG 1 1 1
M END
$$$$
In the following, I round-trip it through RDKit then let the tool set the
charges in the atom block.
% python set_atom_block_charges.py piperidine.sdf --roundtrip
piperidine
RDKit 3D
6 6 0 0 1 0 0 0 0 0999 V2000
-1.4650 0.7843 -0.9210 N 0 3 0 0 0 0 0 0 0 0 0 0
0.0601 0.7265 -0.6801 C 0 0 0 0 0 0 0 0 0 0 0 0
0.6663 -0.3976 -1.5418 C 0 0 0 0 0 0 0 0 0 0 0 0
-0.0188 -1.7539 -1.2886 C 0 0 0 0 0 0 0 0 0 0 0 0
-1.5436 -1.6645 -1.4884 C 0 0 0 0 0 0 0 0 0 0 0 0
-2.1760 -0.5554 -0.6261 C 0 0 0 0 0 0 0 0 0 0 0 0
1 2 1 0
1 6 1 0
2 3 1 0
3 4 1 0
4 5 1 0
5 6 1 0
M CHG 1 1 1
M END
$$$$
You can also use the --verify flag to generate and compare the SMILES strings
before and after the conversion.
Best regards,
Andrew
[email protected]
# Copy charges from the "M CHG" data lines to the atom block
# Written by Andrew Dalke, 2 October 2020
import argparse
import sys
import gzip
# This requires chemfp. See https://chemfp.com/license/ for licenses.
# The default license is the "chemfp Base License"
from chemfp import text_toolkit # For Linux: python -m pip install chemfp -i https://chemfp.com/packages/
# From the documentation:
# 0 = uncharged or value other than these, 1 = +3, 2 = +2, 3 = +1,
# 4 = doublet radical, 5 = -1, 6 = -2, 7 = -3
_chg_to_ccc_table = {
+3: 1,
+2: 2,
+1: 3,
-1: 5,
-2: 6,
-3: 7,
}
class ChargeError(Exception):
pass
def set_atom_block_charges(record, verify=False):
lines = record.splitlines(True)
num_atoms = int(lines[3][:3])
num_bonds = int(lines[3][3:6])
for line in lines[3 + num_atoms + num_bonds:]:
if line.startswith(b"M END"):
break
if not line.startswith(b"M CHG"):
continue
# Process each charge assignment on this line
try:
num_fields = int(line[6:9])
for field_index in range(num_fields):
offset = 10 + 8*field_index
atom_index = int(line[offset:offset+3])
chg = int(line[offset+4:offset+7])
# 0 means uncharged or unsupported charge
ccc = _chg_to_ccc_table.get(chg, 0)
# Update the correct atom line to have the charge
atom_line = lines[3 + atom_index]
atom_line = b"%s%3d%s" % (atom_line[:36], ccc, atom_line[39:])
lines[3 + atom_index] = atom_line
except (ValueError, IndexError) as err:
# This shouldn't happen unless I did something wrong.
raise ChargeError("Unable to process charges in %r" % (line,))
if verify:
# Remove the "M CHG" line
i = lines.index(line)
lines[i] = b""
return b"".join(lines)
### Handle command-line processing
parser = argparse.ArgumentParser(
description="copy charge information from the 'M CHG' data line to the atom block")
parser.add_argument("--output", "-o", metavar="FILENAME",
help="output file name (default: stdout)")
parser.add_argument("--roundtrip", action="store_true",
help="use RDKit to parse the record and regenerate the SDF record")
parser.add_argument("--verify", action="store_true",
help="ensure the input and output SMILES match")
parser.add_argument("--no-set", action="store_true",
help="don't set the charges (useful if you want to see the round-trip output)")
parser.add_argument("input", nargs="?", metavar="FILENAME",
help="input filename (default: stdin)")
def open_binary_output(filename):
if filename is None:
return getattr(sys.stdout, "buffer", sys.stdout) # support Python 2 and 3
elif filename.endswith(".gz"):
return gzip.open(filename, "wb")
else:
return open(filename, "wb")
def main():
args = parser.parse_args()
verify = args.verify
roundtrip = args.roundtrip
no_set = args.no_set
if verify or roundtrip:
from chemfp import rdkit_toolkit as T
with text_toolkit.read_sdf_records(args.input) as record_reader:
outfile = open_binary_output(args.output)
for record in record_reader:
if roundtrip:
mol = T.parse_molecule(record, "sdf", errors="ignore")
if mol is None:
sys.stderr.write("Warning: Could not process molecule record: %s -- skipping\n" % (
record_reader.location.where(),))
continue
record = T.create_bytes(mol, "sdf")
if verify:
# Get a reference SMILES string
smi1 = T.create_string(T.parse_molecule(record, "sdf"), "smistring")
if no_set:
output = record
else:
try:
output = set_atom_block_charges(record, verify)
except ChargeError as err:
sys.stderr.write("ERROR: %s: %s\n" % (err, record_reader.location.where()))
raise SystemExit("Exiting.")
if verify:
# Ensure the input and output SMILES strings match
smi2 = T.create_string(T.parse_molecule(output, "sdf"), "smistring")
if smi1 != smi2:
sys.stderr.write("Could not adjust charges correctly: %s\n" % (
record_reader.location.where()))
sys.stderr.write("Original SMILES: %r\n" % (smi1,))
sys.stderr.write("Changed SMILES : %r\n" % (smi2,))
sys.stderr.write("Skipping.\n")
continue
outfile.write(output)
if __name__ == "__main__":
import signal
signal.signal(signal.SIGINT, signal.SIG_DFL) # Allow ^C to kill the process.
signal.signal(signal.SIGPIPE, signal.SIG_DFL) # Allow the output pipe to be closed
main()
_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss