On May 1, 2015, at 12:01 AM, Michael Reutlinger wrote: > However, in some cases this does not help. E.g. when an unknown atom (most of > the time this is X) is found in the MolBlock the import fails with an > Post-condition Violation and None is yielded. This is fine to detect the > problem BUT it is impossible to get any information about the molecule which > failed.
As a backup solution, outside of RDKit, you might try my chemfp package, available from https://chem-fingerprints.googlecode.com/files/chemfp-1.1.tar.gz (Hmm, looks like I need to migrate that away from Google Code.) One of the internal functions [*] has a way to read individual SDF records as text: >>> for record in sdf_reader.open_sdf("tests/pubchem.sdf"): ... print record.split("\n", 1)[0] ... 9425004 9425009 9425012 9425015 9425018 9425021 If you use the bit of code in this email after my signature you can extract the tag/data pair from the record: >>> from chemfp import sdf_reader >>> for record in sdf_reader.open_sdf("tests/pubchem.sdf"): ... id = record.partition("\n")[0] ... tags = dict(get_sdf_tag_pairs(record)) ... print id, tags["PUBCHEM_OPENEYE_ISO_SMILES"] ... 9425004 CC1=CC(=NN1CC(=O)NNC(=O)\C=C\C2=C(C=CC=C2Cl)F)C 9425009 CC1=CC(=NN1CC(=O)NNC(=O)CCC2=NC(=NO2)C3=CC=CC=C3)C 9425012 CCC1=NOC(=C1C(=O)NNC(=O)CN2C(=CC(=N2)C)C)C 9425015 CC1=CC(=NN1CC(=O)NNC(=O)CCC(=O)C2=CC=C(C=C2)C3=CC=CC=C3)C 9425018 CC1=CC(=NN1CC(=O)NNC(=O)C2=CC=CC=C2SCC(=O)N(C)C)C ... I also included a function called "MolFromSDBlock" which is like "MolFromMolBlock" except that it also copies over the tag data as properties. In that way you can get what you want from RDKit like this: >>> for record in sdf_reader.open_sdf("/Users/dalke/databases/chembl_14.sdf"): ... mol = MolFromSDBlock(record) ... if mol is None: ... print "Could not process", dict(get_sdf_tag_pairs(record))["chembl_id"] ... else: ... print mol.GetProp("chembl_id"), mol.GetNumAtoms() ... CHEMBL438581 165 CHEMBL155459 44 CHEMBL154288 52 CHEMBL443179 56 CHEMBL443183 92 CHEMBL443332 18 .. CHEMBL265763 40 [01:03:52] Explicit valence for atom # 0 B, 5, is greater than permitted Could not process CHEMBL268118 CHEMBL265830 29 ... I've also sketched out a solution which returns an empty molecule with the "_Name", "_Error", and properties set from the SD tag. There's only one line to comment out to get it, but I've not actually tested that code path. Be aware that I wasn't quite as experienced in how to parse SD files when I wrote code for chemfp-1.1 some 5 years ago. For example, you shouldn't have tag data with a line starting with a '>'. [*] By "internal" I mean that it's not documented and not part of the stable API. In fact, it has changed in more recent versions of chemfp, where similar functionality is now part of the stable API. However, those more recent versions, while still free/open source software, are a commercial product and costs money. Contact me if you are interesting in purchasing a copy. :) > My question is if there is a way to get to the data even for those cases? The > files tend to be very big so accessing the molecule re-parsing it > line-by-line in python to get the name for a specific molecule number (found > by enumerating the supplier) is not really an option. My timing numbers for chemfp-1.1 had about the same performance as RDKit's own parser. In newer versions I fixed some of the corner cases, and rewrote the code in C for better performance. > What would be a good solution in my opinion is to create an empty molecule > with all sd properties, including _Name, in case of an error instead of None. > The actual error could then also be communicated into python via an '_Error' > property. ... > Maybe this behaviour could be activated via an option and the default would > be to return None, to not break any existing code. It would have to be via an option, for exactly the reason you highlighted. The option might look like: ForwardSDMolSupplier(...., onError=handler) The simplest is if "handler" is one of a handful of possible string values: - "None" to return None on failure; the current behavior - "ErrorMol" to return an error molecule like you describe Personally, I would love some easy way from the Python API to get access to the warning and error messages without having to intercept the log messages. I think that something like this is the way to get there. Cheers, Andrew [email protected] ##### extract all of the tag/value pairs from an SD record import re _tag_pattern = re.compile(r">.*<([^>]*)>.*\n") _next_tag_pattern = re.compile(r"(\n)\n(>|\$\$\$\$\n)") def get_sdf_tag_pairs(record): # Hack solution! start = record.find("M END\n") if start == -1: return [] start += 7 if record[start:start+1] != ">": # At the '$$$$', or in the wrong format. return [] terms = [] while 1: m = _tag_pattern.match(record, start) if m is None: break tag = m.group(1) data_start = m.end() m = _next_tag_pattern.search(record, data_start) if m is None: break data_end = m.start(1) terms.append( (tag, record[data_start:data_end]) ) start = m.start(2) return terms from rdkit.Chem import MolFromMolBlock, Mol def MolFromSDBlock(molBlock, sanitize=True, removeHs=True, strictParsing=True, includeTags=True): mol = MolFromMolBlock(molBlock, sanitize, removeHs, strictParsing) if mol is None: # Comment out the next line to implement Michael Reutlinger's # proposed variation return mol # Create an empty molecule with the right _Name and an _Error mol = Chem.Mol() mol.SetProp("_Name", molBlock.split("\n", 1)[0]) mol.SetProp("_Error", "could not process") # MolFromMolBlock reader doesn't process the properties. # According to # http://www.mail-archive.com/[email protected]/msg01436.html # I can make a new SDMolSupplier, then SetData(), get the first # record, and get its property names. Not going to do that. # Instead, read the tag data myself. if includeTags: for k, v in get_sdf_tag_pairs(molBlock): if k[:1] != "_": mol.SetProp(k, v) return mol ##### ------------------------------------------------------------------------------ One dashboard for servers and applications across Physical-Virtual-Cloud Widest out-of-the-box monitoring support with 50+ applications Performance metrics, stats and reports that give you Actionable Insights Deep dive visibility with transaction tracing using APM Insight. http://ad.doubleclick.net/ddm/clk/290420510;117567292;y _______________________________________________ Rdkit-discuss mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

