Hi Markus, > On Oct 2, 2020, at 19:56, Markus Metz <metm...@gmail.com> wrote: > I have a question to the sd file format. > When I write charged molecules via rdkit I noticed that the charge definition > in the atom block is not written. > The charge is written at the end of the entry. > So far this worked perfectly fine for me.
The ctfile documentation I have from 2011 says this of the charge definition in the atom block: Wider range of values in M CHG and M RAD lines below. Retained for compatibility with older Ctabs, M CHG and M RAD lines take precedence. and With Ctab version V2000, the dd and ccc fields have been superseded by the M ISO, M CHG, and M RAD lines in the properties block, described below. For compatibility, all releases since ISIS 1.0: • Write appropriate values in both places if the values are in the old range. • Use the atom block fields if there are no M ISO, M CHG, or M RAD lines in the properties block. Support for these atom block fields might be removed in future releases of Symyx software. Further, I looked into this when I wrote the blog post http://www.dalkescientific.com/writings/diary/archive/2020/09/25/mixing_text_and_chemistry_toolkits.html a couple of week ago, and found the 1992 JCICS paper "Description of Several Chemical Structure File Formats Used by Computer Programs Developed at Molecular Design Limited" by Dalby et al. has the "Wider range ... Retained for compatibility with older Ctabs" in it. So including the charge in the atom block as well as in the properties block is a 28+ year old backwards compatibility practice. > Now, I am using a program which reads the atom block charge info only. > Is there a way in rdkit to enable the charge written in the atom block? No. The code in Code/GraphMol/FileParsers/MolFileWriter.cpp has it hard-coded to 0. > Do you have any thoughts on this? The two I can think of are: - post-processing to add it back in, - pass it through another toolkit which adds the duplicated charge information I've attached a program for the first of these options. The command-line tools reads an SDF and generates a new SDF with the "M CHG" lines added to the atom block. Here's the --help: =================== usage: set_atom_block_charges.py [-h] [--output FILENAME] [--roundtrip] [--verify] [--no-set] [FILENAME] copy charge information from the 'M CHG' data line to the atom block positional arguments: FILENAME input filename (default: stdin) optional arguments: -h, --help show this help message and exit --output FILENAME, -o FILENAME output file name (default: stdout) --roundtrip use RDKit to parse the record and regenerate the SDF record --verify ensure the input and output SMILES match --no-set don't set the charges (useful if you want to see the round-trip output) =================== This depends on the latest commercial version chemfp to identify records in an SDF and to help with the verification. While chemfp is not open source, the base license lets you use this functionality for in-house use. (See the file for installation details; the pre-compiled package only installs on Linux-based OSes.) Or, you can grab set_atom_block_charges() from the code (and some code it depends on) so you don't need chemfp at all. In the following, I round-trip the input through RDKit but don't set the atom block charges: % python set_atom_block_charges.py piperidine.sdf --roundtrip --no-set piperidine RDKit 3D 6 6 0 0 1 0 0 0 0 0999 V2000 -1.4650 0.7843 -0.9210 N 0 0 0 0 0 0 0 0 0 0 0 0 0.0601 0.7265 -0.6801 C 0 0 0 0 0 0 0 0 0 0 0 0 0.6663 -0.3976 -1.5418 C 0 0 0 0 0 0 0 0 0 0 0 0 -0.0188 -1.7539 -1.2886 C 0 0 0 0 0 0 0 0 0 0 0 0 -1.5436 -1.6645 -1.4884 C 0 0 0 0 0 0 0 0 0 0 0 0 -2.1760 -0.5554 -0.6261 C 0 0 0 0 0 0 0 0 0 0 0 0 1 2 1 0 1 6 1 0 2 3 1 0 3 4 1 0 4 5 1 0 5 6 1 0 M CHG 1 1 1 M END $$$$ In the following, I round-trip it through RDKit then let the tool set the charges in the atom block. % python set_atom_block_charges.py piperidine.sdf --roundtrip piperidine RDKit 3D 6 6 0 0 1 0 0 0 0 0999 V2000 -1.4650 0.7843 -0.9210 N 0 3 0 0 0 0 0 0 0 0 0 0 0.0601 0.7265 -0.6801 C 0 0 0 0 0 0 0 0 0 0 0 0 0.6663 -0.3976 -1.5418 C 0 0 0 0 0 0 0 0 0 0 0 0 -0.0188 -1.7539 -1.2886 C 0 0 0 0 0 0 0 0 0 0 0 0 -1.5436 -1.6645 -1.4884 C 0 0 0 0 0 0 0 0 0 0 0 0 -2.1760 -0.5554 -0.6261 C 0 0 0 0 0 0 0 0 0 0 0 0 1 2 1 0 1 6 1 0 2 3 1 0 3 4 1 0 4 5 1 0 5 6 1 0 M CHG 1 1 1 M END $$$$ You can also use the --verify flag to generate and compare the SMILES strings before and after the conversion. Best regards, Andrew da...@dalkescientific.com
# Copy charges from the "M CHG" data lines to the atom block # Written by Andrew Dalke, 2 October 2020 import argparse import sys import gzip # This requires chemfp. See https://chemfp.com/license/ for licenses. # The default license is the "chemfp Base License" from chemfp import text_toolkit # For Linux: python -m pip install chemfp -i https://chemfp.com/packages/ # From the documentation: # 0 = uncharged or value other than these, 1 = +3, 2 = +2, 3 = +1, # 4 = doublet radical, 5 = -1, 6 = -2, 7 = -3 _chg_to_ccc_table = { +3: 1, +2: 2, +1: 3, -1: 5, -2: 6, -3: 7, } class ChargeError(Exception): pass def set_atom_block_charges(record, verify=False): lines = record.splitlines(True) num_atoms = int(lines[3][:3]) num_bonds = int(lines[3][3:6]) for line in lines[3 + num_atoms + num_bonds:]: if line.startswith(b"M END"): break if not line.startswith(b"M CHG"): continue # Process each charge assignment on this line try: num_fields = int(line[6:9]) for field_index in range(num_fields): offset = 10 + 8*field_index atom_index = int(line[offset:offset+3]) chg = int(line[offset+4:offset+7]) # 0 means uncharged or unsupported charge ccc = _chg_to_ccc_table.get(chg, 0) # Update the correct atom line to have the charge atom_line = lines[3 + atom_index] atom_line = b"%s%3d%s" % (atom_line[:36], ccc, atom_line[39:]) lines[3 + atom_index] = atom_line except (ValueError, IndexError) as err: # This shouldn't happen unless I did something wrong. raise ChargeError("Unable to process charges in %r" % (line,)) if verify: # Remove the "M CHG" line i = lines.index(line) lines[i] = b"" return b"".join(lines) ### Handle command-line processing parser = argparse.ArgumentParser( description="copy charge information from the 'M CHG' data line to the atom block") parser.add_argument("--output", "-o", metavar="FILENAME", help="output file name (default: stdout)") parser.add_argument("--roundtrip", action="store_true", help="use RDKit to parse the record and regenerate the SDF record") parser.add_argument("--verify", action="store_true", help="ensure the input and output SMILES match") parser.add_argument("--no-set", action="store_true", help="don't set the charges (useful if you want to see the round-trip output)") parser.add_argument("input", nargs="?", metavar="FILENAME", help="input filename (default: stdin)") def open_binary_output(filename): if filename is None: return getattr(sys.stdout, "buffer", sys.stdout) # support Python 2 and 3 elif filename.endswith(".gz"): return gzip.open(filename, "wb") else: return open(filename, "wb") def main(): args = parser.parse_args() verify = args.verify roundtrip = args.roundtrip no_set = args.no_set if verify or roundtrip: from chemfp import rdkit_toolkit as T with text_toolkit.read_sdf_records(args.input) as record_reader: outfile = open_binary_output(args.output) for record in record_reader: if roundtrip: mol = T.parse_molecule(record, "sdf", errors="ignore") if mol is None: sys.stderr.write("Warning: Could not process molecule record: %s -- skipping\n" % ( record_reader.location.where(),)) continue record = T.create_bytes(mol, "sdf") if verify: # Get a reference SMILES string smi1 = T.create_string(T.parse_molecule(record, "sdf"), "smistring") if no_set: output = record else: try: output = set_atom_block_charges(record, verify) except ChargeError as err: sys.stderr.write("ERROR: %s: %s\n" % (err, record_reader.location.where())) raise SystemExit("Exiting.") if verify: # Ensure the input and output SMILES strings match smi2 = T.create_string(T.parse_molecule(output, "sdf"), "smistring") if smi1 != smi2: sys.stderr.write("Could not adjust charges correctly: %s\n" % ( record_reader.location.where())) sys.stderr.write("Original SMILES: %r\n" % (smi1,)) sys.stderr.write("Changed SMILES : %r\n" % (smi2,)) sys.stderr.write("Skipping.\n") continue outfile.write(output) if __name__ == "__main__": import signal signal.signal(signal.SIGINT, signal.SIG_DFL) # Allow ^C to kill the process. signal.signal(signal.SIGPIPE, signal.SIG_DFL) # Allow the output pipe to be closed main()
_______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss