Hi Rudy,
> On Feb 27, 2022, at 20:55, Rudy Richardson <rjr...@umich.edu> wrote: > > I have a library of ~1000 compounds as SMILES strings with an appended name > code and a property. For example: > > c1ccc(c2ccccc2)cc1 0001 -2.52 > > Where "0001" is the name code and "-2.52" is a physicochemical property of > the molecule. > > I would like to convert these strings to a concatenated SDF file, If you're comfortable working with Python, here's an example using the pybel interface. First, here's how to get the name code and property >>> from openbabel import pybel >>> mol = pybel.readstring("smi", "c1ccc(c2ccccc2)cc1\t0001\t-2.52") >>> mol <openbabel.pybel.Molecule object at 0x1101dece0> >>> mol.title '0001\t-2.52' In that case I used tabs (represented as "\t"), because I believe that's what's in your file. That would explain the extra space between the fields. I'll use Python's string.split() to split on any whitespace (which includes both spaces and tabs) >>> mol.title.split() ['0001', '-2.52'] and assign them to the variables "name_code" and "value". >>> name_code, value = mol.title.split() The pybel API has a "write()" method on molecules which formats it into a given format. Here's what it looks like in "sdf". >>> print(mol.write("sdf")) 0001 -2.52 OpenBabel02272221522D 12 13 0 0 0 0 0 0 0 0999 V2000 ... I need to change the title and add a "logX" data item, which I can do with: >>> mol.title = name_code >>> mol.data["logX"] = value giving most of what you wanted. >>> print(mol.write("sdf")) 0001 OpenBabel02272221552D 12 13 0 0 0 0 0 0 0 0999 V2000 0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 1 12 2 0 0 0 0 1 2 1 0 0 0 0 2 3 2 0 0 0 0 3 4 1 0 0 0 0 4 5 1 0 0 0 0 4 11 2 0 0 0 0 5 10 2 0 0 0 0 5 6 1 0 0 0 0 6 7 2 0 0 0 0 7 8 1 0 0 0 0 8 9 2 0 0 0 0 9 10 1 0 0 0 0 11 12 1 0 0 0 0 M END > <logX> -2.52 $$$$ You also had a "No." field. I don't know if that is the index of the input record, or the integer value of the name_code, like: >>> int(name_code) 1 I'll assume it's the input index. I'll use Python's built-in "enumerate()" function. What it does is it add an index for each element of an iterator. For example, I can iterate through the characters of "ABCD" like this: >>> for c in "ABCD": ... print(c) ... A B C D What enumerate() does is for each X in the input iterator, it returns (index, X) >>> for i, c in enumerate("ABCD"): ... print(i, c) ... 0 A 1 B 2 C 3 D I can also specify the initial index, for example, to start at 1: >>> for i, c in enumerate("ABCD", 1): ... print(i, c) ... 1 A 2 B 3 C 4 D The last bit to know is that pybel's "readfile" gives a way to iterate over all molecules in a file. >>> from openbabel import pybel >>> for i, mol in enumerate(pybel.readfile("smi", "wikipedia2.smi"), 1): ... print("Entry#:", i, repr(mol.title)) ... if i == 10: ... break ... Entry#: 1 'Ammonia' Entry#: 2 'Aspirin' Entry#: 3 'Acetylene' Entry#: 4 'Adenosine triphosphate' Entry#: 5 'Ampicillin' Entry#: 6 'Ascorbic acid' Entry#: 7 'Ascorbic acid' Entry#: 8 'Amphetamine' Entry#: 9 'Aspartame' Entry#: 10 'Amoxicillin' Finally, all the coordinates were 0.0. To make things a bit nicer, use the "make2D()" or "make3D()" methods to add 2D or 3D coordinates, respectively. Your example uses 3D, so I'll do that. Putting it all together, along with some use of Python's "argparse" molecule to handle command-line processing (which I won't discuss here) gives the "rjrich.py" program, attached. It's used like this: % python rjrich.py test.smi You can also change the output tag, and the output file name, like this: % python rjrich.py test.smi --tag Cacao2 -o cacao2.sdf (I believe if you're using Open Babel under Windows, with Python installed, then you should use "py" instead of "python" to run the program.) Cheers, Andrew da...@dalkescientific.com
import sys import argparse from openbabel import pybel # Use the "argparse" module to handle command-line argument processing parser = argparse.ArgumentParser( description = "convert SMILES with name and data value to SDF" ) parser.add_argument("--tag", default = "logX", help = "SDF data tag to store the value") parser.add_argument("--output", "-o", help = "output SDF filename (default: stdout") parser.add_argument("filename") def main(): args = parser.parse_args() # Open the SMILES file for reading mol_reader = pybel.readfile("smi", args.filename) # Figure out where to write the output if args.output is None: output_file = sys.stdout else: output_file = open(args.output, "w") # Process each for mol_no, mol in enumerate(mol_reader, 1): # The title looks like "0001 -2.52" with the name_code # followed by whitespace followed by the value title = mol.title name_code, value = mol.title.split() # Update the title to have just the name_code mol.title = name_code # Add new data items mol.data["No."] = mol_no mol.data[args.tag] = value # Generate 3D coordinates mol.make3D() # Write the result output_file.write(mol.write("sdf")) # The standard Python way to recognize this is being run as a # command-line program. if __name__ == "__main__": main()
_______________________________________________ OpenBabel-discuss mailing list OpenBabel-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/openbabel-discuss