Hi Rudy,


> On Feb 27, 2022, at 20:55, Rudy Richardson <rjr...@umich.edu> wrote:
> 
> I have a library of ~1000 compounds as SMILES strings with an appended name 
> code and a property. For example:
> 
> c1ccc(c2ccccc2)cc1    0001    -2.52
> 
> Where "0001" is the name code and "-2.52" is a physicochemical property of 
> the molecule.
> 
> I would like to convert these strings to a concatenated SDF file,


If you're comfortable working with Python, here's an example using the pybel 
interface.

First, here's how to get the name code and property

>>> from openbabel import pybel
>>> mol = pybel.readstring("smi", "c1ccc(c2ccccc2)cc1\t0001\t-2.52")
>>> mol
<openbabel.pybel.Molecule object at 0x1101dece0>
>>> mol.title
'0001\t-2.52'

In that case I used tabs (represented as "\t"), because I believe that's what's 
in your file. That would explain the extra space between the fields.

I'll use Python's string.split() to split on any whitespace (which includes 
both spaces and tabs)

>>> mol.title.split()
['0001', '-2.52']

and assign them to the variables "name_code" and "value".

>>> name_code, value = mol.title.split()

The pybel API has a "write()" method on molecules which formats it into a given 
format. Here's what it looks like in "sdf".

>>> print(mol.write("sdf"))
0001    -2.52
 OpenBabel02272221522D

 12 13  0  0  0  0  0  0  0  0999 V2000
 ...

I need to change the title and add a "logX" data item, which I can do with:

>>> mol.title = name_code
>>> mol.data["logX"] = value

giving most of what you wanted.

>>> print(mol.write("sdf"))
0001
 OpenBabel02272221552D

 12 13  0  0  0  0  0  0  0  0999 V2000
    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  1 12  2  0  0  0  0
  1  2  1  0  0  0  0
  2  3  2  0  0  0  0
  3  4  1  0  0  0  0
  4  5  1  0  0  0  0
  4 11  2  0  0  0  0
  5 10  2  0  0  0  0
  5  6  1  0  0  0  0
  6  7  2  0  0  0  0
  7  8  1  0  0  0  0
  8  9  2  0  0  0  0
  9 10  1  0  0  0  0
 11 12  1  0  0  0  0
M  END
>  <logX>
-2.52

$$$$

You also had a "No." field. I don't know if that is the index of the input 
record, or the integer value of the name_code, like:

>>> int(name_code)
1

I'll assume it's the input index.

I'll use Python's built-in "enumerate()" function. What it does is it add an 
index for each element of an iterator. For example, I can iterate through the 
characters of "ABCD" like this:

>>> for c in "ABCD":
...   print(c)
...
A
B
C
D

What enumerate() does is for each X in the input iterator, it returns (index, X)

>>> for i, c in enumerate("ABCD"):
...   print(i, c)
...
0 A
1 B
2 C
3 D

I can also specify the initial index, for example, to start at 1:

>>> for i, c in enumerate("ABCD", 1):
...   print(i, c)
...
1 A
2 B
3 C
4 D

The last bit to know is that pybel's "readfile" gives a way to iterate over all 
molecules in a file.

>>> from openbabel import pybel
>>> for i, mol in enumerate(pybel.readfile("smi", "wikipedia2.smi"), 1):
...   print("Entry#:", i, repr(mol.title))
...   if i == 10:
...     break
...
Entry#: 1 'Ammonia'
Entry#: 2 'Aspirin'
Entry#: 3 'Acetylene'
Entry#: 4 'Adenosine triphosphate'
Entry#: 5 'Ampicillin'
Entry#: 6 'Ascorbic acid'
Entry#: 7 'Ascorbic acid'
Entry#: 8 'Amphetamine'
Entry#: 9 'Aspartame'
Entry#: 10 'Amoxicillin'

Finally, all the coordinates were 0.0. To make things a bit nicer, use the 
"make2D()" or "make3D()" methods to add 2D or 3D coordinates, respectively. 
Your example uses 3D, so I'll do that.

Putting it all together, along with some use of Python's "argparse" molecule to 
handle command-line processing (which I won't discuss here) gives the 
"rjrich.py" program, attached.

It's used like this:

  % python rjrich.py test.smi

You can also change the output tag, and the output file name, like this:

  % python rjrich.py test.smi --tag Cacao2 -o cacao2.sdf

(I believe if you're using Open Babel under Windows, with Python installed, 
then you should use "py" instead of "python" to run the program.)

Cheers,

                                Andrew
                                da...@dalkescientific.com

import sys
import argparse
from openbabel import pybel

# Use the "argparse" module to handle command-line argument processing
parser = argparse.ArgumentParser(
    description = "convert SMILES with name and data value to SDF"
    )
parser.add_argument("--tag",
                        default = "logX",
                        help = "SDF data tag to store the value")
parser.add_argument("--output", "-o",
                        help = "output SDF filename (default: stdout")
parser.add_argument("filename")

def main():
    args = parser.parse_args()
    
    # Open the SMILES file for reading
    mol_reader = pybel.readfile("smi", args.filename)

    # Figure out where to write the output
    if args.output is None:
        output_file = sys.stdout
    else:
        output_file = open(args.output, "w")

    # Process each 
    for mol_no, mol in enumerate(mol_reader, 1):
        # The title looks like "0001    -2.52" with the name_code
        # followed by whitespace followed by the value
        title = mol.title        
        name_code, value = mol.title.split()

        # Update the title to have just the name_code
        mol.title = name_code
        
        # Add new data items
        mol.data["No."] = mol_no
        mol.data[args.tag] = value

        # Generate 3D coordinates
        mol.make3D()
        
        # Write the result
        output_file.write(mol.write("sdf"))


# The standard Python way to recognize this is being run as a
# command-line program.
if __name__ == "__main__":
    main()
_______________________________________________
OpenBabel-discuss mailing list
OpenBabel-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/openbabel-discuss

Reply via email to