Re: [ccp4bb] PDB data mining

Thomas Holder Tue, 08 Mar 2011 23:51:41 -0800

Jason, that was pretty fast! You delivered the script before someoneelse could even suggest PyMOL ;)


If you use


    from chempy import cpv
    cpv.distance(aa.coord, bb.coord)

instead of

    cmd.get_distance( "index %s" % aa.index, "index %s" % bb.index)

it will be significantly faster.

Cheers,
  Thomas

On Mar 9, 2011, at 7:43 AM, Jason Vertrees wrote:

Hi Cale,

For any given structure in the PDB, I want to identify all theHistidine ND1atoms. I then want to consider these atoms in pairs, measure thedistancein Angstroms between the ND1 atoms in each pair, and compile thesedistances(along with residue numbers of the pair) in a table. I then wantto repeatthis procedure for each unique structure in the PDB and generate atable
containing all occurrences of HisND1 pairs with their corresponding
separation distance. Amongst other things, I want e.g. generate ahistogram
from this table and determine e.g. the shortest HisND1 pair distance
observed and the structure in which this happens. Does anyone haveanysuggestions for any tools I might be able to use to perform thissearch?


This looked like a fun little PyMOL task.  So, I wrote a script to do
it.  This will iterate over all the PDBs in a given directory and
calculate the non-redundant distances of all HIS ND1s.  A table is
written to pre_hist.csv.  Use R or something similar to create the
histogram and do the numerical analysis.

To use this, first, you'll need a prepared copy of the PDB, suitable
for this task, downloaded locally in pdb_path.  You will need to set
"pdb_path" in the script below to wherever your PDBs are.  (The other
kind folks on this list already pointed you to resources for mirroring
the PDB, if you haven't done it already.)  To run the script below
just save it to disk as "his.py" and launch it with:

# from Linux
pymol -cq his.py

# or, from Mac
/Applications/PyMOLX11Hybrid.app/Contents/MacOS/MacPyMOL -cq his.py

If you're a Linux/Mac user, when the script finishes, just type:

sort -n -k8  pre_hist.csv | head -2

into your BASH shell to get the shortest distance. For my examplePDB it was:


$ sort -n -k8  pre_hist.csv | head -2
PDB    CHAIN    RESI    ATOM-A    CHAIN    RESI    ATOM-B    DISTANCE
5rla   B        187     3729      C        312     7075      2.838208

To view the shortest distance use the ATOM-A and ATOM-B fields from
the output file.  From my example above I would need to fetch the
protein and create a distance from "index 3729" to "index 7075".
Here's how:

# fetch the protein
fetch 5rla, async=0

# show the distance
distance index 3729, index 7075

# zoom on the new distance
zoom dist01


Here's the Python script that does the work:

import glob, os, pymol, sys
from pymol import cmd

the_pdb="/Users/vertrees/small_pdb"
files = glob.glob(the_pdb+os.sep+"*.pdb")

if not len(files):

print "Please set 'the_pdb' variable to a valid path containingPDB files."

   sys.exit(1)
else:
   print "Processing %d files." % len(files)

s, outFile = "resn HIS and name ND1", "pre_hist.csv"

f = open(outFile, 'wb')
# write the header
f.write("PDB\tCHAIN\tRESI\tATOM-A\tCHAIN\tRESI\tATOM-B\tDISTANCE\n")
# for each file in the mirror
for x in files:
   cmd.load(x,finish=1)
   n = cmd.get_names()[0]
   m = cmd.get_model(s).atom
   # pairwise for each atom
   for aa in m:
       for bb in m:
           # avoid duplicates
           if aa.index==bb.index: continue
           f.write( "%s\t%s\t%s\t%s\t%s\t%s\t%d\t%f\n" %
                    (n, aa.chain, aa.resi, aa.index,
                        bb.chain, bb.resi, bb.index,
                        cmd.get_distance( "index %s" % aa.index,
"index %s" % bb.index)))
   cmd.delete(n)
f.close()

print "Processed %d files. Please see %s for results." %(len(files), outFile)


Cheers,

-- Jason

--
Jason Vertrees, PhD
PyMOL Product Manager
Schrodinger, LLC

(e) [email protected]
(o) +1 (603) 374-7120


--
Thomas Holder
MPI for Developmental Biology
Spemannstr. 35
D-72076 Tübingen

Re: [ccp4bb] PDB data mining

Reply via email to