Re: [Rdkit-discuss] Molecule with no atoms, so is it valid?

2012-05-25 Thread Greg Landrum
On Wed, May 23, 2012 at 2:24 PM, JP jeanpaul.ebe...@inhibox.com wrote:
 This molecule with no atoms being valid is a questionable design decision
 (not your fault of course, you are just implementing a spec).

 I think that the smiles writer should not write an empty molecule (you could
 change the method signature to take yet another param empties=True/False
 but I do not think this is correct either).  And IMHO the parser should not
 read one either.

The SMILES parser doesn't

 What happens with the following very organic smiles file?

 C JP1
 CCC JP2

  JP4

 The generator is going to give me an empty (third) molecule?  So I have to
 always dirty my code with m.GetNumAtoms()  0 in that loop.  What is the
 empty molecule is at the end of the file (ouch)?
 Also what if you have an identifier for the empty molecule.  So replace the
 empty third line with  JP3 ?

The SMILES parser currently does not accept empty strings. It
generates an error message and returns None. I don't have any plans
to change this.


 What is the ForwardSDMolSupplier/SDWriter going to do with this empty mol?
  Does it just write name and properties?

The name, the empty mol block, and the properties:
In [2]: m = Chem.Mol()
In [4]: import StringIO
In [5]: sio = StringIO.StringIO()
In [6]: m.SetProp('prop1','foo')
In [7]: m.SetProp('prop2','bar')
In [8]: w = Chem.SDWriter(sio)
In [9]: w.write(m)
In [10]: w.close()
In [11]: print sio.getvalue()

 RDKit

  0  0  0  0  0  0  0  0  0  0999 V2000
M  END
  prop1  (1)
foo

  prop2  (1)
bar




The SD readers can also work with this:
In [22]: rdr = Chem.SDMolSupplier()
In [23]: rdr.SetData(sio.getvalue())
In [24]: m2 = rdr.next()
In [25]: m2.GetNumAtoms()
Out[25]: 0

In [26]: m2.GetProp('prop1')
Out[26]: 'foo'


 I don't want to be controversial or anything, but I disagree with almost
 everyone else about this, in that we should use common sense and not stick
 to the spec in this case.
 Does someone have a use-case for an empty molecule ?  At least I can
 understand what people are using this for

Having an empty molecule can be useful in cases where you want to
include information about the molecule (i..e using the property fields
in an SDF) but either do not know what the actual structure is, do not
want to disclose what the structure is, or cannot represent the
structure in a reasonable manner in an SDF (e.g. solids,
organometalllics, etc.).

 Also having the writer do one thing, and the parser do another means that
 RDKit cannot read the files/molecules it generates.  I think this is a big
 inconsistency, and not one deserving of this excellent bit of software.

I also don't like the inconsistency, but I don't see how it's really
avoidable: empty molecules need to be supported, so there are limited
choices available.

-greg

--
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Molecule with no atoms, so is it valid?

2012-05-25 Thread Eddie Cao
Hi,

One potential improvement that may help is to use Python's warning package. If 
RDKit generates warnings using this package, then user can control whether they 
want the empty SMILES to return None silently or raise an exception.

Eddie


On May 24, 2012, at 11:06 PM, Greg Landrum wrote:

 On Wed, May 23, 2012 at 2:24 PM, JP jeanpaul.ebe...@inhibox.com wrote:
 This molecule with no atoms being valid is a questionable design decision
 (not your fault of course, you are just implementing a spec).
 
 I think that the smiles writer should not write an empty molecule (you could
 change the method signature to take yet another param empties=True/False
 but I do not think this is correct either).  And IMHO the parser should not
 read one either.
 
 The SMILES parser doesn't
 
 What happens with the following very organic smiles file?
 
 C JP1
 CCC JP2
 
  JP4
 
 The generator is going to give me an empty (third) molecule?  So I have to
 always dirty my code with m.GetNumAtoms()  0 in that loop.  What is the
 empty molecule is at the end of the file (ouch)?
 Also what if you have an identifier for the empty molecule.  So replace the
 empty third line with  JP3 ?
 
 The SMILES parser currently does not accept empty strings. It
 generates an error message and returns None. I don't have any plans
 to change this.
 
 
 What is the ForwardSDMolSupplier/SDWriter going to do with this empty mol?
  Does it just write name and properties?
 
 The name, the empty mol block, and the properties:
 In [2]: m = Chem.Mol()
 In [4]: import StringIO
 In [5]: sio = StringIO.StringIO()
 In [6]: m.SetProp('prop1','foo')
 In [7]: m.SetProp('prop2','bar')
 In [8]: w = Chem.SDWriter(sio)
 In [9]: w.write(m)
 In [10]: w.close()
 In [11]: print sio.getvalue()
 
 RDKit
 
  0  0  0  0  0  0  0  0  0  0999 V2000
 M  END
 prop1  (1)
 foo
 
 prop2  (1)
 bar
 
 
 
 
 The SD readers can also work with this:
 In [22]: rdr = Chem.SDMolSupplier()
 In [23]: rdr.SetData(sio.getvalue())
 In [24]: m2 = rdr.next()
 In [25]: m2.GetNumAtoms()
 Out[25]: 0
 
 In [26]: m2.GetProp('prop1')
 Out[26]: 'foo'
 
 
 I don't want to be controversial or anything, but I disagree with almost
 everyone else about this, in that we should use common sense and not stick
 to the spec in this case.
 Does someone have a use-case for an empty molecule ?  At least I can
 understand what people are using this for
 
 Having an empty molecule can be useful in cases where you want to
 include information about the molecule (i..e using the property fields
 in an SDF) but either do not know what the actual structure is, do not
 want to disclose what the structure is, or cannot represent the
 structure in a reasonable manner in an SDF (e.g. solids,
 organometalllics, etc.).
 
 Also having the writer do one thing, and the parser do another means that
 RDKit cannot read the files/molecules it generates.  I think this is a big
 inconsistency, and not one deserving of this excellent bit of software.
 
 I also don't like the inconsistency, but I don't see how it's really
 avoidable: empty molecules need to be supported, so there are limited
 choices available.
 
 -greg
 
 --
 Live Security Virtual Conference
 Exclusive live event will cover all the ways today's security and 
 threat landscape has changed and how IT managers can respond. Discussions 
 will include endpoint security, mobile security and the latest in malware 
 threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
 ___
 Rdkit-discuss mailing list
 Rdkit-discuss@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


--
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Molecule with no atoms, so is it valid?

2012-05-25 Thread Paul Emsley
On 23/05/12 13:24, JP wrote:

 The generator is going to give me an empty (third) molecule?  So I 
 have to always dirty my code with m.GetNumAtoms()  0 in that loop. 
  What is the empty molecule is at the end of the file (ouch)?


I'm curious about this point in particular. I don't know the details of 
your code of course, but it seems to me that it is more pythonic to ask 
forgiveness than permission.  So replace GetNumAtoms() with try/except 
and catch the exception of bad-stuff-happened - presumably when you do a 
clean_mol[i] or some such...

Paul.




--
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Molecule with no atoms, so is it valid?

2012-05-23 Thread Greg Landrum
Dear all,

The svn version of the RDKit now behaves like this:
In [2]: m = Chem.MolFromMolFile('empty.mol')

In [3]: m.GetNumAtoms()
Out[3]: 0

Notice that there are no longer error messages.

There is, however, the following wart:
In [6]: Chem.MolToSmiles(m)
Out[6]: ''

In [7]: m2 = Chem.MolFromSmiles('')
SMILES Parse Error: syntax error for input:

In [8]: m2 is None
Out[8]: True

The SMILES writer happily generates an empty string for the molecule
with no atoms, but the SMILES parser generates an error.

The behavior of the parser is, I believe, consistent with Daylight.
The question is what the writer should do. I see two choices:
1) As is: Writer generates an empty string, but the parser generates an error
2) Change MolToSmiles so that it generates an error if the molecule
has no atoms.

I prefer the status quo (choice 1) because I don't really like the
idea that a valid molecule would lead to an error in the writer.

-greg

--
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Molecule with no atoms, so is it valid?

2012-04-30 Thread Jan Holst Jensen
On 2012-04-30 13:41, Paul Emsley wrote:
 On 30/04/12 01:03, Eddie Cao wrote:
 Hi Andrew,

 I also prefer #2. #1 is not quite sensible because many readers like 
 MolFromSmiles will return None on failure and it will be hard to distinguish 
 bad input from an empty one if we choose to do #1. Semantically, in many 
 RDKit use cases, None and Empty Mol are as different as a webpage not found 
 (HTTP 404) and a blank web page.
 FWIW, I agree to #2 too.

 Paul.

My humble opinion would also be that #2 is the best option. In line with 
the analogy given by Eddie, treating an empty molecule as an error is 
equivalent to saying that a blank string is equal to NULL - which would 
be unfortunate, since there is a big semantic difference.

Also, imagine that you delete all atoms and bonds in a molecule in order 
to replace them. Enforcing the semantics of empty molecules are 
invalid should then cause your code to break half-way through the process ?

Cheers
-- Jan Holst Jensen

--
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Molecule with no atoms, so is it valid?

2012-04-29 Thread Eddie Cao
Hi Andrew,

I also prefer #2. #1 is not quite sensible because many readers like 
MolFromSmiles will return None on failure and it will be hard to distinguish 
bad input from an empty one if we choose to do #1. Semantically, in many RDKit 
use cases, None and Empty Mol are as different as a webpage not found (HTTP 
404) and a blank web page.

Eddie

On Apr 28, 2012, at 3:15 PM, Andrew Dalke wrote:

 It looks like this discussion is still unresolved. Greg's going to have a 
 pile of things on his desk (eDesk?) when he gets back from holidays. :)
 
 On Jan 30, 2012, at 1:26 PM, JP wrote:
 But then I will have to add the if not clean_mol.GetNumAtoms(): 
 before/after replacing/editing molecule parts, after reading molecules, 
 before writing them etc. i.e. I'd need this statement in a lot of places.  
 This is why I asked if it should be considered a valid molecule - because if 
 these moves in SanitizeMol I wouldn't need any of that e.g. I can assume 
 that the molecule I have in hand, is valid and if I still wanted these 
 molecules (for some not so clear reason) I could just switch of sanitization 
 off, on the methods that allow it.
 
 
 I just ran into this myself. I used SaltRemover on CHEMBL1644029, which is 
 potassium nitrate. The SaltRemover removed both ions, leaving me with an 
 empty structure.
 
 I blithely wrote the de-salted structures to a file. It wasn't until I had 
 RDKit read the structures later, when trying to figure out the message 
 Problems encountered parsing data fields, that I realized that I had 
 de-salted a salt, leaving nothing.
 
 
 
 
 from rdkit import Chem
 from rdkit.Chem import SaltRemover
 remover = SaltRemover.SaltRemover()
 mol = Chem.MolFromSmiles([K+].[O-]N(=O)=O)
 mol2 = remover.StripMol(mol)
 Chem.SDWriter(/dev/stdout).write(mol2)
 
RDKit  
 
 0  0  0  0  0  0  0  0  0  0999 V2000
 M  END
 
 writer = Chem.SDWriter(empty.sdf)
 writer.write(mol2)
 writer.close()
 for mol in Chem.ForwardSDMolSupplier(empty.sdf):
 ...   print I have, repr(mol)
 ... 
 [00:02:16] ERROR: on line 5 Problems encountered parsing data fields
 [00:02:16] ERROR: moving to the begining of the next molecule
 I have rdkit.Chem.rdchem.Mol object at 0x101e00b40
 
 
 I agree with JP. Either:
 1) this is an ERROR, in which case a) the reader should yield a None rather 
 than a molecule object, and b) the writer should refused to write it, or
 
 2) this is not an ERROR, and the reader should be fixed so it doesn't log an 
 error message for this case. Since it's either valid syntax or it isn't, it 
 probably shouldn't generate any message.
 
 Of these, I prefer #2.
 
 Cheers,
 
 
   Andrew
   da...@dalkescientific.com
 
 
 
 --
 Live Security Virtual Conference
 Exclusive live event will cover all the ways today's security and 
 threat landscape has changed and how IT managers can respond. Discussions 
 will include endpoint security, mobile security and the latest in malware 
 threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
 ___
 Rdkit-discuss mailing list
 Rdkit-discuss@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


--
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Molecule with no atoms, so is it valid?

2012-04-28 Thread Andrew Dalke
It looks like this discussion is still unresolved. Greg's going to have a pile 
of things on his desk (eDesk?) when he gets back from holidays. :)

On Jan 30, 2012, at 1:26 PM, JP wrote:
 But then I will have to add the if not clean_mol.GetNumAtoms(): 
 before/after replacing/editing molecule parts, after reading molecules, 
 before writing them etc. i.e. I'd need this statement in a lot of places.  
 This is why I asked if it should be considered a valid molecule - because if 
 these moves in SanitizeMol I wouldn't need any of that e.g. I can assume that 
 the molecule I have in hand, is valid and if I still wanted these molecules 
 (for some not so clear reason) I could just switch of sanitization off, on 
 the methods that allow it.


I just ran into this myself. I used SaltRemover on CHEMBL1644029, which is 
potassium nitrate. The SaltRemover removed both ions, leaving me with an empty 
structure.

I blithely wrote the de-salted structures to a file. It wasn't until I had 
RDKit read the structures later, when trying to figure out the message 
Problems encountered parsing data fields, that I realized that I had 
de-salted a salt, leaving nothing.




 from rdkit import Chem
 from rdkit.Chem import SaltRemover
 remover = SaltRemover.SaltRemover()
 mol = Chem.MolFromSmiles([K+].[O-]N(=O)=O)
 mol2 = remover.StripMol(mol)
 Chem.SDWriter(/dev/stdout).write(mol2)

RDKit  

 0  0  0  0  0  0  0  0  0  0999 V2000
M  END

 writer = Chem.SDWriter(empty.sdf)
 writer.write(mol2)
 writer.close()
 for mol in Chem.ForwardSDMolSupplier(empty.sdf):
...   print I have, repr(mol)
... 
[00:02:16] ERROR: on line 5 Problems encountered parsing data fields
[00:02:16] ERROR: moving to the begining of the next molecule
I have rdkit.Chem.rdchem.Mol object at 0x101e00b40
 

I agree with JP. Either:
 1) this is an ERROR, in which case a) the reader should yield a None rather 
than a molecule object, and b) the writer should refused to write it, or

 2) this is not an ERROR, and the reader should be fixed so it doesn't log an 
error message for this case. Since it's either valid syntax or it isn't, it 
probably shouldn't generate any message.

Of these, I prefer #2.

Cheers,


Andrew
da...@dalkescientific.com



--
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Molecule with no atoms, so is it valid?

2012-01-30 Thread JP
Thanks for the explanation and the helpful-as-ever code snippets Greg.

A molecule without atoms sounds a bit weird to me but it seems to be a
perfectly legal (in CTFile definition) to have a no-atom molecule.
Each record can hold one molecule (which may be blank). from this very
inspiring article Why not to use SDF:
http://molmatinf.com/whynotmolsdf.html

So a case for it exists.

On the practical side, in terms of the API some things to think about:

- Would you define the checks to exclude, or the ones to include in
SanitzeMol?
- Shouldn't the default behaviour be to run all checks (or?) ?
- How would sanitizemol be called from other methods such a SDMolSupplier
etc. (the methods which allow for optional sanitization)?  Are you going to
allow for flags to be passed there too?  Isn't this going to make the API
unwieldy?
- Will it the flags be in the form of ints which you OR together
CHECK_EMPTY_MOL | CHECK_KEKULIZE (like Java/C++ kind of options) ?

PS the salt removal code is working fine as is.  I'd rather remove both
[Li].[Br] and check for the empty molecule as Nik suggested rather than
keep one of those little buggers.

Have a nice evening!


-
Jean-Paul Ebejer
Early Stage Researcher


On 30 January 2012 19:20, Greg Landrum greg.land...@gmail.com wrote:

 On Mon, Jan 30, 2012 at 1:26 PM, JP jeanpaul.ebe...@inhibox.com wrote:
 
  But then I will have to add the if not clean_mol.GetNumAtoms():
  before/after replacing/editing molecule parts, after reading molecules,
  before writing them etc. i.e. I'd need this statement in a lot of places.
  This is why I asked if it should be considered a valid molecule -
 because if
  these moves in SanitizeMol I wouldn't need any of that e.g. I can assume
  that the molecule I have in hand, is valid and if I still wanted these
  molecules (for some not so clear reason) I could just switch of
 sanitization
  off, on the methods that allow it.
 
  Just asking, there is probably some good design decision for this which
 I am
  missing... (hence the question)

 It's not an easy one. I believe there's not really a strong argument
 for either behavior. As you've seen, the current behavior of the RDKit
 is to treat molecules without atoms as completely legal entities. You
 can test for this case the way Nik pointed out.

 I'm playing with the idea of making the SanitizeMol routine
 configurable, so you could pass in a set of flags to control which
 operations are carried out. If this happens, a check for zero atoms
 flag that defaults to false could be added. I just created a feature
 request for this:

 https://sourceforge.net/tracker/?func=detailaid=3481729group_id=160139atid=814653

 In the meantime, if you'd like to change the definition of
 sanitization, the easiest way to do so would be to write your own
 function, perhaps something like this (not tested):

 def mySanitize(mol):
  if not mol.GetNumAtoms():
raise ValueError,'molecule has no atoms'
  Chem.SanitizeMol(mol)


 Note: for the particular case of salt stripping, you can ensure that
 the salt stripper doesn't remove all atoms using the
 dontRemoveEverything optional argument. Take a look at the help for
 SaltRemover.StripMol:

 http://rdkit.org/docs/api/rdkit.Chem.SaltRemover.SaltRemover-class.html#StripMol

 -greg

--
Try before you buy = See our experts in action!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-dev2___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Molecule with no atoms, so is it valid?

2012-01-30 Thread Andrew Dalke
On Jan 30, 2012, at 9:24 PM, JP wrote:
 A molecule without atoms sounds a bit weird to me but it seems to be a 
 perfectly legal (in CTFile definition) to have a no-atom molecule. 

Some SMILES files readers will break if there are no atoms. For example, the 
Daylight toolkit SMILES parser gives an error if the SMILES is the empty 
string. The OpenSMILES spec follows this tradition and does not allow the empty 
string; although that would be trivial to add.


Andrew
da...@dalkescientific.com



--
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss