Hi all,

  I spent the last couple of week working on a project related to molecular 
property and model calculations. It's called 'propbox', and is available from 
https://bitbucket.org/dalke/propbox . 

There are two parts to it:
  - a (sparse) table, where the rows are structures and the columns are 
properties
  - a resolver, which knows how to fill in columns.

When the resolver fills in a column, it can ask the column for other columns. 
This in turn may trigger more resolvers, until eventually the system ends up 
with the input structure, and can fill in everything else.

Another way to think of it is that each resolver is set of Pipeline Pilot/KNIME 
nodes, with inputs and outputs connected by name. But unlike those data flow 
systems, which feed data forwards, prop box computes values on-demand.

I have it configured to use the descriptors in rdkit.Chem.Descriptors, plus a 
few other properties. You can use the command-line tool to compute a set of 
properties, and save the result to a CSV file.

  % ./rdprops tests/benzodiazepine.smi --columns 'id,HeavyAtomCount,MolWt' 
--dialect excel | head
  id,HeavyAtomCount,MolWt
  1688,21,319.191
  1963,24,359.216
  2118,22,308.772
  2802,22,315.716
  2809,22,314.728
  2997,19,270.719
  3016,20,284.746
  3261,21,294.745
  3299,25,360.772

I've also hooked it up to the NCI resolver, to convert a structure into an 
IUPAC name. Each request takes a short wait, so I've asked it to compute one 
record at a time, to get feedback while it's running:

% ./rdprops tests/benzodiazepine.smi --columns 'id,cansmiles,nci_iupac_name' 
--batch-size 1
id      cansmiles       nci_iupac_name
1688    CN1C(=O)CN=C(c2ccc(Cl)cc2)c2cc(Cl)ccc21 
7-chloro-5-(4-chlorophenyl)-1-methyl-3H-1,4-benzodiazepin-2-one
1963    OCc1nnc2n1-c1ccc(Cl)cc1C(c1ccccc1Cl)=NC2        
[8-chloro-6-(2-chlorophenyl)-4H-[1,2,4]triazolo[4,5-a][1,4]benzodiazepin-1-yl]methanol
2118    Cc1nnc2n1-c1ccc(Cl)cc1C(c1ccccc1)=NC2   
8-chloro-1-methyl-6-phenyl-4H-[1,2,4]triazolo[4,3-a][1,4]benzodiazepine
2802    O=C1CN=C(c2ccccc2Cl)c2cc([N+](=O)[O-])ccc2N1    
5-(2-chlorophenyl)-7-nitro-1,3-dihydro-1,4-benzodiazepin-2-one
2809    O=C(O)C1N=C(c2ccccc2)c2cc(Cl)ccc2NC1=O  
7-chloro-2-oxo-5-phenyl-1,3-dihydro-1,4-benzodiazepine-3-carboxylic acid
2997    O=C1CN=C(c2ccccc2)c2cc(Cl)ccc2N1        
7-chloro-5-phenyl-1,3-dihydro-1,4-benzodiazepin-2-one
3016    CN1C(=O)CN=C(c2ccccc2)c2cc(Cl)ccc21     
7-chloro-1-methyl-5-phenyl-3H-1,4-benzodiazepin-2-one
3261    Clc1ccc2c(c1)C(c1ccccc1)=NCc1nncn1-2    
8-Chloro-6-phenyl-4H-[1,2,4]triazolo[4,3-a][1,4]benzodiazepine
3299    CCOC(=O)C1N=C(c2ccccc2F)c2cc(Cl)ccc2NC1=O       ethyl 
7-chloro-5-(2-fluorophenyl)-2-oxo-1,3-dihydro-1,4-benzodiazepine-3-carboxylate
3369    CN1C(=O)CN=C(c2ccccc2F)c2cc(Cl)ccc21    
7-chloro-5-(2-fluorophenyl)-1-methyl-3H-1,4-benzodiazepin-2-one
3380    CN1C(=O)CN=C(c2ccccc2F)c2cc([N+](=O)[O-])ccc21  
5-(2-fluorophenyl)-1-methyl-7-nitro-3H-1,4-benzodiazepin-2-one
   ....


There's also a Python API for defining resolvers and working with the tables. 
If you write your own resolver you can add it to 'rdprops' using the '-r' 
option:

  % cat model.py
  
  from propbox import calculate, collect_resolvers
  
  @calculate()
  def calc_model(MolWt, NumHDonors):
    return MolWt * 12.34 / (NumHDonors + 1)
  
  resolver = collect_resolvers()

This is a non-standard resolver, so I need to tell rdprops the path for how to 
load it::

  % ./rdprops --columns 'id,model' -r model.resolver tests/CHEMBL11862.sdf
  id    model
  CHEMBL11862   509.61732


The README is pretty extensive, see  https://bitbucket.org/dalke/propbox/src

The need comes out of work I did for a couple of companies to help integrate a 
wide number of descriptor and model calculations. One example workflow might be:

  - the input is a SMILES string
  - turn the SMILES into a molecule
  - desalt it and standardize the charge model
  - use the clean molecule to compute logP,
      molecular weight, and a few other desciptors
  - use the descriptors to compute model-1,
      model-2, and model-3
  - use model-1, model-2, and model-3 to compute
      a consensus model

In propbox this would be managed by setting up names for each of the 
intermediate properties, and a set of resolvers for them. (A "property" in 
propbox is any Python object, including an RDKit molecule.)
 

Setting up names is hard. Nomenclature is a messy thing. People can't even 
decide between "MW", "Mw", "MolWt" and other short forms for "molecular 
weight". That's why propbox includes a module system to isolate incompatible 
names, as well as an alias system.

Give it whirl and let me know what you think.


                                Andrew
                                [email protected]



------------------------------------------------------------------------------
_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to