Re: [Rdkit-discuss] Molecule with no atoms, so is it valid?
On Wed, May 23, 2012 at 2:24 PM, JP jeanpaul.ebe...@inhibox.com wrote: This molecule with no atoms being valid is a questionable design decision (not your fault of course, you are just implementing a spec). I think that the smiles writer should not write an empty molecule (you could change the method signature to take yet another param empties=True/False but I do not think this is correct either). And IMHO the parser should not read one either. The SMILES parser doesn't What happens with the following very organic smiles file? C JP1 CCC JP2 JP4 The generator is going to give me an empty (third) molecule? So I have to always dirty my code with m.GetNumAtoms() 0 in that loop. What is the empty molecule is at the end of the file (ouch)? Also what if you have an identifier for the empty molecule. So replace the empty third line with JP3 ? The SMILES parser currently does not accept empty strings. It generates an error message and returns None. I don't have any plans to change this. What is the ForwardSDMolSupplier/SDWriter going to do with this empty mol? Does it just write name and properties? The name, the empty mol block, and the properties: In [2]: m = Chem.Mol() In [4]: import StringIO In [5]: sio = StringIO.StringIO() In [6]: m.SetProp('prop1','foo') In [7]: m.SetProp('prop2','bar') In [8]: w = Chem.SDWriter(sio) In [9]: w.write(m) In [10]: w.close() In [11]: print sio.getvalue() RDKit 0 0 0 0 0 0 0 0 0 0999 V2000 M END prop1 (1) foo prop2 (1) bar The SD readers can also work with this: In [22]: rdr = Chem.SDMolSupplier() In [23]: rdr.SetData(sio.getvalue()) In [24]: m2 = rdr.next() In [25]: m2.GetNumAtoms() Out[25]: 0 In [26]: m2.GetProp('prop1') Out[26]: 'foo' I don't want to be controversial or anything, but I disagree with almost everyone else about this, in that we should use common sense and not stick to the spec in this case. Does someone have a use-case for an empty molecule ? At least I can understand what people are using this for Having an empty molecule can be useful in cases where you want to include information about the molecule (i..e using the property fields in an SDF) but either do not know what the actual structure is, do not want to disclose what the structure is, or cannot represent the structure in a reasonable manner in an SDF (e.g. solids, organometalllics, etc.). Also having the writer do one thing, and the parser do another means that RDKit cannot read the files/molecules it generates. I think this is a big inconsistency, and not one deserving of this excellent bit of software. I also don't like the inconsistency, but I don't see how it's really avoidable: empty molecules need to be supported, so there are limited choices available. -greg -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Molecule with no atoms, so is it valid?
Hi, One potential improvement that may help is to use Python's warning package. If RDKit generates warnings using this package, then user can control whether they want the empty SMILES to return None silently or raise an exception. Eddie On May 24, 2012, at 11:06 PM, Greg Landrum wrote: On Wed, May 23, 2012 at 2:24 PM, JP jeanpaul.ebe...@inhibox.com wrote: This molecule with no atoms being valid is a questionable design decision (not your fault of course, you are just implementing a spec). I think that the smiles writer should not write an empty molecule (you could change the method signature to take yet another param empties=True/False but I do not think this is correct either). And IMHO the parser should not read one either. The SMILES parser doesn't What happens with the following very organic smiles file? C JP1 CCC JP2 JP4 The generator is going to give me an empty (third) molecule? So I have to always dirty my code with m.GetNumAtoms() 0 in that loop. What is the empty molecule is at the end of the file (ouch)? Also what if you have an identifier for the empty molecule. So replace the empty third line with JP3 ? The SMILES parser currently does not accept empty strings. It generates an error message and returns None. I don't have any plans to change this. What is the ForwardSDMolSupplier/SDWriter going to do with this empty mol? Does it just write name and properties? The name, the empty mol block, and the properties: In [2]: m = Chem.Mol() In [4]: import StringIO In [5]: sio = StringIO.StringIO() In [6]: m.SetProp('prop1','foo') In [7]: m.SetProp('prop2','bar') In [8]: w = Chem.SDWriter(sio) In [9]: w.write(m) In [10]: w.close() In [11]: print sio.getvalue() RDKit 0 0 0 0 0 0 0 0 0 0999 V2000 M END prop1 (1) foo prop2 (1) bar The SD readers can also work with this: In [22]: rdr = Chem.SDMolSupplier() In [23]: rdr.SetData(sio.getvalue()) In [24]: m2 = rdr.next() In [25]: m2.GetNumAtoms() Out[25]: 0 In [26]: m2.GetProp('prop1') Out[26]: 'foo' I don't want to be controversial or anything, but I disagree with almost everyone else about this, in that we should use common sense and not stick to the spec in this case. Does someone have a use-case for an empty molecule ? At least I can understand what people are using this for Having an empty molecule can be useful in cases where you want to include information about the molecule (i..e using the property fields in an SDF) but either do not know what the actual structure is, do not want to disclose what the structure is, or cannot represent the structure in a reasonable manner in an SDF (e.g. solids, organometalllics, etc.). Also having the writer do one thing, and the parser do another means that RDKit cannot read the files/molecules it generates. I think this is a big inconsistency, and not one deserving of this excellent bit of software. I also don't like the inconsistency, but I don't see how it's really avoidable: empty molecules need to be supported, so there are limited choices available. -greg -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Molecule with no atoms, so is it valid?
On 23/05/12 13:24, JP wrote: The generator is going to give me an empty (third) molecule? So I have to always dirty my code with m.GetNumAtoms() 0 in that loop. What is the empty molecule is at the end of the file (ouch)? I'm curious about this point in particular. I don't know the details of your code of course, but it seems to me that it is more pythonic to ask forgiveness than permission. So replace GetNumAtoms() with try/except and catch the exception of bad-stuff-happened - presumably when you do a clean_mol[i] or some such... Paul. -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Molecule with no atoms, so is it valid?
Dear all, The svn version of the RDKit now behaves like this: In [2]: m = Chem.MolFromMolFile('empty.mol') In [3]: m.GetNumAtoms() Out[3]: 0 Notice that there are no longer error messages. There is, however, the following wart: In [6]: Chem.MolToSmiles(m) Out[6]: '' In [7]: m2 = Chem.MolFromSmiles('') SMILES Parse Error: syntax error for input: In [8]: m2 is None Out[8]: True The SMILES writer happily generates an empty string for the molecule with no atoms, but the SMILES parser generates an error. The behavior of the parser is, I believe, consistent with Daylight. The question is what the writer should do. I see two choices: 1) As is: Writer generates an empty string, but the parser generates an error 2) Change MolToSmiles so that it generates an error if the molecule has no atoms. I prefer the status quo (choice 1) because I don't really like the idea that a valid molecule would lead to an error in the writer. -greg -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Molecule with no atoms, so is it valid?
On 2012-04-30 13:41, Paul Emsley wrote: On 30/04/12 01:03, Eddie Cao wrote: Hi Andrew, I also prefer #2. #1 is not quite sensible because many readers like MolFromSmiles will return None on failure and it will be hard to distinguish bad input from an empty one if we choose to do #1. Semantically, in many RDKit use cases, None and Empty Mol are as different as a webpage not found (HTTP 404) and a blank web page. FWIW, I agree to #2 too. Paul. My humble opinion would also be that #2 is the best option. In line with the analogy given by Eddie, treating an empty molecule as an error is equivalent to saying that a blank string is equal to NULL - which would be unfortunate, since there is a big semantic difference. Also, imagine that you delete all atoms and bonds in a molecule in order to replace them. Enforcing the semantics of empty molecules are invalid should then cause your code to break half-way through the process ? Cheers -- Jan Holst Jensen -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Molecule with no atoms, so is it valid?
Hi Andrew, I also prefer #2. #1 is not quite sensible because many readers like MolFromSmiles will return None on failure and it will be hard to distinguish bad input from an empty one if we choose to do #1. Semantically, in many RDKit use cases, None and Empty Mol are as different as a webpage not found (HTTP 404) and a blank web page. Eddie On Apr 28, 2012, at 3:15 PM, Andrew Dalke wrote: It looks like this discussion is still unresolved. Greg's going to have a pile of things on his desk (eDesk?) when he gets back from holidays. :) On Jan 30, 2012, at 1:26 PM, JP wrote: But then I will have to add the if not clean_mol.GetNumAtoms(): before/after replacing/editing molecule parts, after reading molecules, before writing them etc. i.e. I'd need this statement in a lot of places. This is why I asked if it should be considered a valid molecule - because if these moves in SanitizeMol I wouldn't need any of that e.g. I can assume that the molecule I have in hand, is valid and if I still wanted these molecules (for some not so clear reason) I could just switch of sanitization off, on the methods that allow it. I just ran into this myself. I used SaltRemover on CHEMBL1644029, which is potassium nitrate. The SaltRemover removed both ions, leaving me with an empty structure. I blithely wrote the de-salted structures to a file. It wasn't until I had RDKit read the structures later, when trying to figure out the message Problems encountered parsing data fields, that I realized that I had de-salted a salt, leaving nothing. from rdkit import Chem from rdkit.Chem import SaltRemover remover = SaltRemover.SaltRemover() mol = Chem.MolFromSmiles([K+].[O-]N(=O)=O) mol2 = remover.StripMol(mol) Chem.SDWriter(/dev/stdout).write(mol2) RDKit 0 0 0 0 0 0 0 0 0 0999 V2000 M END writer = Chem.SDWriter(empty.sdf) writer.write(mol2) writer.close() for mol in Chem.ForwardSDMolSupplier(empty.sdf): ... print I have, repr(mol) ... [00:02:16] ERROR: on line 5 Problems encountered parsing data fields [00:02:16] ERROR: moving to the begining of the next molecule I have rdkit.Chem.rdchem.Mol object at 0x101e00b40 I agree with JP. Either: 1) this is an ERROR, in which case a) the reader should yield a None rather than a molecule object, and b) the writer should refused to write it, or 2) this is not an ERROR, and the reader should be fixed so it doesn't log an error message for this case. Since it's either valid syntax or it isn't, it probably shouldn't generate any message. Of these, I prefer #2. Cheers, Andrew da...@dalkescientific.com -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Molecule with no atoms, so is it valid?
It looks like this discussion is still unresolved. Greg's going to have a pile of things on his desk (eDesk?) when he gets back from holidays. :) On Jan 30, 2012, at 1:26 PM, JP wrote: But then I will have to add the if not clean_mol.GetNumAtoms(): before/after replacing/editing molecule parts, after reading molecules, before writing them etc. i.e. I'd need this statement in a lot of places. This is why I asked if it should be considered a valid molecule - because if these moves in SanitizeMol I wouldn't need any of that e.g. I can assume that the molecule I have in hand, is valid and if I still wanted these molecules (for some not so clear reason) I could just switch of sanitization off, on the methods that allow it. I just ran into this myself. I used SaltRemover on CHEMBL1644029, which is potassium nitrate. The SaltRemover removed both ions, leaving me with an empty structure. I blithely wrote the de-salted structures to a file. It wasn't until I had RDKit read the structures later, when trying to figure out the message Problems encountered parsing data fields, that I realized that I had de-salted a salt, leaving nothing. from rdkit import Chem from rdkit.Chem import SaltRemover remover = SaltRemover.SaltRemover() mol = Chem.MolFromSmiles([K+].[O-]N(=O)=O) mol2 = remover.StripMol(mol) Chem.SDWriter(/dev/stdout).write(mol2) RDKit 0 0 0 0 0 0 0 0 0 0999 V2000 M END writer = Chem.SDWriter(empty.sdf) writer.write(mol2) writer.close() for mol in Chem.ForwardSDMolSupplier(empty.sdf): ... print I have, repr(mol) ... [00:02:16] ERROR: on line 5 Problems encountered parsing data fields [00:02:16] ERROR: moving to the begining of the next molecule I have rdkit.Chem.rdchem.Mol object at 0x101e00b40 I agree with JP. Either: 1) this is an ERROR, in which case a) the reader should yield a None rather than a molecule object, and b) the writer should refused to write it, or 2) this is not an ERROR, and the reader should be fixed so it doesn't log an error message for this case. Since it's either valid syntax or it isn't, it probably shouldn't generate any message. Of these, I prefer #2. Cheers, Andrew da...@dalkescientific.com -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Molecule with no atoms, so is it valid?
Thanks for the explanation and the helpful-as-ever code snippets Greg. A molecule without atoms sounds a bit weird to me but it seems to be a perfectly legal (in CTFile definition) to have a no-atom molecule. Each record can hold one molecule (which may be blank). from this very inspiring article Why not to use SDF: http://molmatinf.com/whynotmolsdf.html So a case for it exists. On the practical side, in terms of the API some things to think about: - Would you define the checks to exclude, or the ones to include in SanitzeMol? - Shouldn't the default behaviour be to run all checks (or?) ? - How would sanitizemol be called from other methods such a SDMolSupplier etc. (the methods which allow for optional sanitization)? Are you going to allow for flags to be passed there too? Isn't this going to make the API unwieldy? - Will it the flags be in the form of ints which you OR together CHECK_EMPTY_MOL | CHECK_KEKULIZE (like Java/C++ kind of options) ? PS the salt removal code is working fine as is. I'd rather remove both [Li].[Br] and check for the empty molecule as Nik suggested rather than keep one of those little buggers. Have a nice evening! - Jean-Paul Ebejer Early Stage Researcher On 30 January 2012 19:20, Greg Landrum greg.land...@gmail.com wrote: On Mon, Jan 30, 2012 at 1:26 PM, JP jeanpaul.ebe...@inhibox.com wrote: But then I will have to add the if not clean_mol.GetNumAtoms(): before/after replacing/editing molecule parts, after reading molecules, before writing them etc. i.e. I'd need this statement in a lot of places. This is why I asked if it should be considered a valid molecule - because if these moves in SanitizeMol I wouldn't need any of that e.g. I can assume that the molecule I have in hand, is valid and if I still wanted these molecules (for some not so clear reason) I could just switch of sanitization off, on the methods that allow it. Just asking, there is probably some good design decision for this which I am missing... (hence the question) It's not an easy one. I believe there's not really a strong argument for either behavior. As you've seen, the current behavior of the RDKit is to treat molecules without atoms as completely legal entities. You can test for this case the way Nik pointed out. I'm playing with the idea of making the SanitizeMol routine configurable, so you could pass in a set of flags to control which operations are carried out. If this happens, a check for zero atoms flag that defaults to false could be added. I just created a feature request for this: https://sourceforge.net/tracker/?func=detailaid=3481729group_id=160139atid=814653 In the meantime, if you'd like to change the definition of sanitization, the easiest way to do so would be to write your own function, perhaps something like this (not tested): def mySanitize(mol): if not mol.GetNumAtoms(): raise ValueError,'molecule has no atoms' Chem.SanitizeMol(mol) Note: for the particular case of salt stripping, you can ensure that the salt stripper doesn't remove all atoms using the dontRemoveEverything optional argument. Take a look at the help for SaltRemover.StripMol: http://rdkit.org/docs/api/rdkit.Chem.SaltRemover.SaltRemover-class.html#StripMol -greg -- Try before you buy = See our experts in action! The most comprehensive online learning library for Microsoft developers is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, Metro Style Apps, more. Free future releases when you subscribe now! http://p.sf.net/sfu/learndevnow-dev2___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Molecule with no atoms, so is it valid?
On Jan 30, 2012, at 9:24 PM, JP wrote: A molecule without atoms sounds a bit weird to me but it seems to be a perfectly legal (in CTFile definition) to have a no-atom molecule. Some SMILES files readers will break if there are no atoms. For example, the Daylight toolkit SMILES parser gives an error if the SMILES is the empty string. The OpenSMILES spec follows this tradition and does not allow the empty string; although that would be trivial to add. Andrew da...@dalkescientific.com -- Keep Your Developer Skills Current with LearnDevNow! The most comprehensive online learning library for Microsoft developers is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, Metro Style Apps, more. Free future releases when you subscribe now! http://p.sf.net/sfu/learndevnow-d2d ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss