Hi all,
I spent the last couple of week working on a project related to molecular
property and model calculations. It's called 'propbox', and is available from
https://bitbucket.org/dalke/propbox .
There are two parts to it:
- a (sparse) table, where the rows are structures and the columns are
properties
- a resolver, which knows how to fill in columns.
When the resolver fills in a column, it can ask the column for other columns.
This in turn may trigger more resolvers, until eventually the system ends up
with the input structure, and can fill in everything else.
Another way to think of it is that each resolver is set of Pipeline Pilot/KNIME
nodes, with inputs and outputs connected by name. But unlike those data flow
systems, which feed data forwards, prop box computes values on-demand.
I have it configured to use the descriptors in rdkit.Chem.Descriptors, plus a
few other properties. You can use the command-line tool to compute a set of
properties, and save the result to a CSV file.
% ./rdprops tests/benzodiazepine.smi --columns 'id,HeavyAtomCount,MolWt'
--dialect excel | head
id,HeavyAtomCount,MolWt
1688,21,319.191
1963,24,359.216
2118,22,308.772
2802,22,315.716
2809,22,314.728
2997,19,270.719
3016,20,284.746
3261,21,294.745
3299,25,360.772
I've also hooked it up to the NCI resolver, to convert a structure into an
IUPAC name. Each request takes a short wait, so I've asked it to compute one
record at a time, to get feedback while it's running:
% ./rdprops tests/benzodiazepine.smi --columns 'id,cansmiles,nci_iupac_name'
--batch-size 1
id cansmiles nci_iupac_name
1688 CN1C(=O)CN=C(c2ccc(Cl)cc2)c2cc(Cl)ccc21
7-chloro-5-(4-chlorophenyl)-1-methyl-3H-1,4-benzodiazepin-2-one
1963 OCc1nnc2n1-c1ccc(Cl)cc1C(c1ccccc1Cl)=NC2
[8-chloro-6-(2-chlorophenyl)-4H-[1,2,4]triazolo[4,5-a][1,4]benzodiazepin-1-yl]methanol
2118 Cc1nnc2n1-c1ccc(Cl)cc1C(c1ccccc1)=NC2
8-chloro-1-methyl-6-phenyl-4H-[1,2,4]triazolo[4,3-a][1,4]benzodiazepine
2802 O=C1CN=C(c2ccccc2Cl)c2cc([N+](=O)[O-])ccc2N1
5-(2-chlorophenyl)-7-nitro-1,3-dihydro-1,4-benzodiazepin-2-one
2809 O=C(O)C1N=C(c2ccccc2)c2cc(Cl)ccc2NC1=O
7-chloro-2-oxo-5-phenyl-1,3-dihydro-1,4-benzodiazepine-3-carboxylic acid
2997 O=C1CN=C(c2ccccc2)c2cc(Cl)ccc2N1
7-chloro-5-phenyl-1,3-dihydro-1,4-benzodiazepin-2-one
3016 CN1C(=O)CN=C(c2ccccc2)c2cc(Cl)ccc21
7-chloro-1-methyl-5-phenyl-3H-1,4-benzodiazepin-2-one
3261 Clc1ccc2c(c1)C(c1ccccc1)=NCc1nncn1-2
8-Chloro-6-phenyl-4H-[1,2,4]triazolo[4,3-a][1,4]benzodiazepine
3299 CCOC(=O)C1N=C(c2ccccc2F)c2cc(Cl)ccc2NC1=O ethyl
7-chloro-5-(2-fluorophenyl)-2-oxo-1,3-dihydro-1,4-benzodiazepine-3-carboxylate
3369 CN1C(=O)CN=C(c2ccccc2F)c2cc(Cl)ccc21
7-chloro-5-(2-fluorophenyl)-1-methyl-3H-1,4-benzodiazepin-2-one
3380 CN1C(=O)CN=C(c2ccccc2F)c2cc([N+](=O)[O-])ccc21
5-(2-fluorophenyl)-1-methyl-7-nitro-3H-1,4-benzodiazepin-2-one
....
There's also a Python API for defining resolvers and working with the tables.
If you write your own resolver you can add it to 'rdprops' using the '-r'
option:
% cat model.py
from propbox import calculate, collect_resolvers
@calculate()
def calc_model(MolWt, NumHDonors):
return MolWt * 12.34 / (NumHDonors + 1)
resolver = collect_resolvers()
This is a non-standard resolver, so I need to tell rdprops the path for how to
load it::
% ./rdprops --columns 'id,model' -r model.resolver tests/CHEMBL11862.sdf
id model
CHEMBL11862 509.61732
The README is pretty extensive, see https://bitbucket.org/dalke/propbox/src
The need comes out of work I did for a couple of companies to help integrate a
wide number of descriptor and model calculations. One example workflow might be:
- the input is a SMILES string
- turn the SMILES into a molecule
- desalt it and standardize the charge model
- use the clean molecule to compute logP,
molecular weight, and a few other desciptors
- use the descriptors to compute model-1,
model-2, and model-3
- use model-1, model-2, and model-3 to compute
a consensus model
In propbox this would be managed by setting up names for each of the
intermediate properties, and a set of resolvers for them. (A "property" in
propbox is any Python object, including an RDKit molecule.)
Setting up names is hard. Nomenclature is a messy thing. People can't even
decide between "MW", "Mw", "MolWt" and other short forms for "molecular
weight". That's why propbox includes a module system to isolate incompatible
names, as well as an alias system.
Give it whirl and let me know what you think.
Andrew
[email protected]
------------------------------------------------------------------------------
_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss