[Administrivia Note: I'm dropping the devel list from this thread] Hi Adrian,
On Thu, May 29, 2008 at 11:39 AM, Adrian Schreyer <[email protected]> wrote: > > I think a feature to handle PDB structures would be a great addition > to RDKit, but I guess this depends if its possible to have thousands > of atoms in an rdmol object. I can't think of any reason why this would be a problem. Sanitizing a molecule with thousands of atoms might take a while, but that's a process that could probably be short-circuited for a protein. > Maybe I should start to explain what kind of implementation I had in > mind. Since the hierarchy in PDB structures can be fragile sometimes, > the most convenient way would be to use a set theory-like > representation. Basically, there could a class rdResidueAtom, subclass > of rdAtom, which in addition to the inherited attributes and functions > also has PDB details as attributes such as > > rdResidueAtom.IsHetatm > rdResidueAtom.SerialNumber > rdResidueAtom.AtomName > rdResidueAtom.AlternateLocation > rdResidueAtom.ResidueName > rdResidueAtom.ChainID > rdResidueAtom.ResidueNumber > rdResidueAtom.InsertCode > rdResidueAtom.Occupancy > rdResidueAtom.BFactor > > the sets would be structure->NMR model->chain->residue->atom then. In > an ideal case, it would be possible to access individual sets through > a hierarchy/index object, e.g. -> getResidue(chainID, resName, resNum, > insCode) would return all atoms with ->atomIdx() that belong to that > set - and accordingly for chains, atoms etc. sounds good so far. > > It would tremendously useful to have cheminformatics functions > available for protein-ligand complexes, particularly to determine > connectivity, aromaticity, assign implicit hydrogens, geometric > functions etc. I'm nodding my head > This, as a result, would simplify many tasks, for instance determining > hydrogen bonding, analysing the geometry of pi-pi interactions and so > on. I still agree > That would make my life a lot easier at least! ;) indeed! Now, after all that agreement, the problems. There are three major concerns for me here: 1) the representation and manipulation of biomolecules isn't something I feel I know much about, so I'm not the one to add these kinds of features. 2) PDB is such a poorly documented format that I wouldn't really want to write a parser. 3) There's the whole problem of adding bonds to the ligands. 4) I think it would be better to have a "proper" data model. point 2 isn't that big of a deal: there are enough parsers out there. a 90% solution for point 3 is probably not that hard. The last 10% would be frustrating. point 4 requires some new data structures; this isn't in principle a problem, but my lack of expertise in the area (point 1) means I can't really propose effective solutions. I think a nice solution to this would be to using something like biopython to handle the biomolecules and adding some functionality to translate "cuts" from the biomolecule into one or more RDKit molecules. This would allow you to transfer the ligand and its neighborhood (however you choose to define that) into RDKit and do the cheminformatics-type manipulations there. opinions? -greg

