Re: [Rdkit-discuss] SDF properties in case of error

Andrew Dalke Fri, 01 May 2015 16:46:05 -0700

On May 1, 2015, at 12:01 AM, Michael Reutlinger wrote:
> However, in some cases this does not help. E.g. when an unknown atom (most of 
> the time this is X) is found in the MolBlock the import fails with an 
> Post-condition Violation and None is yielded. This is fine to detect the 
> problem BUT it is impossible to get any information about the molecule which 
> failed.


As a backup solution, outside of RDKit, you might try my chemfp package, 
available from

https://chem-fingerprints.googlecode.com/files/chemfp-1.1.tar.gz

(Hmm, looks like I need to migrate that away from Google Code.)

One of the internal functions [*] has a way to read individual SDF records as 
text:

  >>> for record in sdf_reader.open_sdf("tests/pubchem.sdf"):
  ...   print record.split("\n", 1)[0]
  ... 
  9425004
  9425009
  9425012
  9425015
  9425018
  9425021

If you use the bit of code in this email after my signature you can extract the 
tag/data pair from the record:

>>> from chemfp import sdf_reader
>>> for record in sdf_reader.open_sdf("tests/pubchem.sdf"):
...   id = record.partition("\n")[0]
...   tags = dict(get_sdf_tag_pairs(record))
...   print id, tags["PUBCHEM_OPENEYE_ISO_SMILES"]
... 
9425004 CC1=CC(=NN1CC(=O)NNC(=O)\C=C\C2=C(C=CC=C2Cl)F)C
9425009 CC1=CC(=NN1CC(=O)NNC(=O)CCC2=NC(=NO2)C3=CC=CC=C3)C
9425012 CCC1=NOC(=C1C(=O)NNC(=O)CN2C(=CC(=N2)C)C)C
9425015 CC1=CC(=NN1CC(=O)NNC(=O)CCC(=O)C2=CC=C(C=C2)C3=CC=CC=C3)C
9425018 CC1=CC(=NN1CC(=O)NNC(=O)C2=CC=CC=C2SCC(=O)N(C)C)C
  ...

I also included a function called "MolFromSDBlock" which is like 
"MolFromMolBlock" except that it also copies over the tag data as properties. 
In that way you can get what you want from RDKit like this:


>>> for record in sdf_reader.open_sdf("/Users/dalke/databases/chembl_14.sdf"):
...   mol = MolFromSDBlock(record)
...   if mol is None:
...     print "Could not process", dict(get_sdf_tag_pairs(record))["chembl_id"]
...   else:
...     print mol.GetProp("chembl_id"), mol.GetNumAtoms()
... 
CHEMBL438581 165
CHEMBL155459 44
CHEMBL154288 52
CHEMBL443179 56
CHEMBL443183 92
CHEMBL443332 18
  ..
CHEMBL265763 40
[01:03:52] Explicit valence for atom # 0 B, 5, is greater than permitted
Could not process CHEMBL268118
CHEMBL265830 29
  ...

I've also sketched out a solution which returns an empty molecule with the 
"_Name", "_Error", and properties set from the SD tag. There's only one line to 
comment out to get it, but I've not actually tested that code path.


Be aware that I wasn't quite as experienced in how to parse SD files when I 
wrote code for chemfp-1.1 some 5 years ago. For example, you shouldn't have tag 
data with a line starting with a '>'.


[*] By "internal" I mean that it's not documented and not part of the stable 
API. In fact, it has changed in more recent versions of chemfp, where similar 
functionality is now part of the stable API. However, those more recent 
versions, while still free/open source software, are a commercial product and 
costs money.

Contact me if you are interesting in purchasing a copy. :)

> My question is if there is a way to get to the data even for those cases? The 
> files tend to be very big so accessing the molecule re-parsing it 
> line-by-line in python to get the name for a specific molecule number (found 
> by enumerating the supplier) is not really an option. 

My timing numbers for chemfp-1.1 had about the same performance as RDKit's own 
parser. In newer versions I fixed some of the corner cases, and rewrote the 
code in C for better performance.


> What would be a good solution in my opinion is to create an empty molecule 
> with all sd properties, including _Name, in case of an error instead of None. 
> The actual error could then also be communicated into python via an '_Error' 
> property.
  ...
> Maybe this behaviour could be activated via an option and the default would 
> be to return None, to not break any existing code. 

It would have to be via an option, for exactly the reason you highlighted.
The option might look like:

   ForwardSDMolSupplier(...., onError=handler)

The simplest is if "handler" is one of a handful of possible string values:

  - "None" to return None on failure; the current behavior
  - "ErrorMol" to return an error molecule like you describe

Personally, I would love some easy way from the Python API to get access to the 
warning and error messages without having to intercept the log messages. I 
think that something like this is the way to get there.

Cheers,


                                Andrew
                                [email protected]

##### extract all of the tag/value pairs from an SD record
import re

_tag_pattern = re.compile(r">.*<([^>]*)>.*\n")
_next_tag_pattern = re.compile(r"(\n)\n(>|\$\$\$\$\n)")

def get_sdf_tag_pairs(record):
    # Hack solution!
    start = record.find("M  END\n")
    if start == -1:
        return []
    
    start += 7
    if record[start:start+1] != ">":
        # At the '$$$$', or in the wrong format.
        return []
    
    terms = []
    while 1:
        m = _tag_pattern.match(record, start)
        if m is None:
            break
        tag = m.group(1)
        data_start = m.end()
        
        m = _next_tag_pattern.search(record, data_start)
        if m is None:
            break
        data_end = m.start(1)
        terms.append( (tag, record[data_start:data_end]) )
        
        start = m.start(2)
    
    return terms

from rdkit.Chem import MolFromMolBlock, Mol

def MolFromSDBlock(molBlock, sanitize=True, removeHs=True, strictParsing=True, 
includeTags=True):
    mol = MolFromMolBlock(molBlock, sanitize, removeHs, strictParsing)
    if mol is None:
        # Comment out the next line to implement Michael Reutlinger's
        # proposed variation
        return mol
        # Create an empty molecule with the right _Name and an _Error
        mol = Chem.Mol()
        mol.SetProp("_Name", molBlock.split("\n", 1)[0])
        mol.SetProp("_Error", "could not process")
    
    # MolFromMolBlock reader doesn't process the properties.
    # According to
    #   
http://www.mail-archive.com/[email protected]/msg01436.html
    # I can make a new SDMolSupplier, then SetData(), get the first
    # record, and get its property names. Not going to do that.
    # Instead, read the tag data myself.
    
    if includeTags:
        for k, v in get_sdf_tag_pairs(molBlock):
            if k[:1] != "_":
                mol.SetProp(k, v)
    
    return mol



#####


------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] SDF properties in case of error

Reply via email to