Re: [ccp4bb] pdb sequence search
Hi Ed, What about submitting the uniprot accession number of your protein to the PDB?. As you know, this will just list the all entries where your protein sequence is there.. -Vandu murugan.. On 6/23/12, Ed Pozharski epozh...@umaryland.edu wrote: Silly question. Say I want to find every structure in the PDB with the exact sequence or with perhaps 1-2 mutations. I know of two ways of doing this. 1. Go to NCBI BLAST and run the sequence against the PDB subset. The resulting list will have identities listed, so manual parsing is doable if there aren't too many hits. 2. PDB and PDBe both have the search by sequence features. Trouble is the default E value seems to be tailored to poor sequence identity (which makes sense if you looking for potential MR models). Sure, I can reduce the target E value, but it's a little cumbersome and I have no idea what the target level should be so that I don't get any 50% identical sequences yet not miss single/double mutants. Wouldn't it be nice if one could use the sequence identity cutoff/query coverage instead? Much more comprehensible than the E-value. Is there a search engine that does that? Seems like a fairly common need, and perhaps I just can't find on PDB website. Thanks in advance for any suggestions, Ed. -- Oh, suddenly throwing a giraffe into a volcano to make water is crazy? Julian, King of Lemurs
Re: [ccp4bb] pdb sequence search
Hi Ed, If you are looking for a specific protein, why not get all PDB files with a DBREF record pointing at the uniprot record of the protein you want? You can do a simple text search in the PDB, e.g. 'MYG_PHYCA'. Cheers,Robbie Date: Fri, 22 Jun 2012 22:39:12 -0400 From: epozh...@umaryland.edu Subject: Re: [ccp4bb] pdb sequence search To: CCP4BB@JISCMAIL.AC.UK Tim, I did not understand your objection against solution 1 - is it because it is not automated? You can sort the results by max. Ident so that you can sroll down to the limit you set yourself. More that it does not generate a list of PDB IDs. What I want to do is to find every structure of a particular protein and line them all up. I am not saying it's not doable with option 1, it's just not too convenient. Why do you think a identity cut-off was a good criterium? I usually cut by E-value because I assume the developers of blast know what they are doing and I have the impression they consider the E-value a better criterium than the max. Ident. Because I want all the structures of a particular protein itself, not it's homologues. I just went through several cycles of reducing E-value down to 1e-100, and I still get one hit included at 88% identity. Setting E-value cutoff to 0 doesn't work, it just returns them all. Well, thanks to you I now see how to figure out the cutoff - the results are sorted by E-values and list them, so I can just go to the first non-identical hit and use a slightly smaller number. It's just that sequence identity is easier for me to interpret and it's (emotionally) easier to select a cutoff at, say, no more than 5 mutations rather than E-value of 10e-150. Cheers, Ed. Cheers -- Oh, suddenly throwing a giraffe into a volcano to make water is crazy? Julian, King of Lemurs
Re: [ccp4bb] pdb sequence search
Hi, The up-to-date list of mappings between PDB and sequence database UniProt is available at - ftp://ftp.ebi.ac.uk/pub/databases/msd/sifts/csv/pdb_chain_uniprot.csv This gives mapping between PDB chains and UniProt accession numbers. This will allow you to find all DB entries for a particular UniProt accession number in the PDB. To answer original question about sequence search the following PDBe service - pdbe.org/fasta allows you to set % identity value and perform search against PDB sequences. cheers, Sameer Velankar PDBe Hi Ed, If you are looking for a specific protein, why not get all PDB files with a DBREF record pointing at the uniprot record of the protein you want? You can do a simple text search in the PDB, e.g. 'MYG_PHYCA'. Cheers,Robbie Date: Fri, 22 Jun 2012 22:39:12 -0400 From: epozh...@umaryland.edu Subject: Re: [ccp4bb] pdb sequence search To: CCP4BB@JISCMAIL.AC.UK Tim, I did not understand your objection against solution 1 - is it because it is not automated? You can sort the results by max. Ident so that you can sroll down to the limit you set yourself. More that it does not generate a list of PDB IDs. What I want to do is to find every structure of a particular protein and line them all up. I am not saying it's not doable with option 1, it's just not too convenient. Why do you think a identity cut-off was a good criterium? I usually cut by E-value because I assume the developers of blast know what they are doing and I have the impression they consider the E-value a better criterium than the max. Ident. Because I want all the structures of a particular protein itself, not it's homologues. I just went through several cycles of reducing E-value down to 1e-100, and I still get one hit included at 88% identity. Setting E-value cutoff to 0 doesn't work, it just returns them all. Well, thanks to you I now see how to figure out the cutoff - the results are sorted by E-values and list them, so I can just go to the first non-identical hit and use a slightly smaller number. It's just that sequence identity is easier for me to interpret and it's (emotionally) easier to select a cutoff at, say, no more than 5 mutations rather than E-value of 10e-150. Cheers, Ed. Cheers -- Oh, suddenly throwing a giraffe into a volcano to make water is crazy? Julian, King of Lemurs
Re: [ccp4bb] pdb sequence search
Because I want all the structures of a particular protein itself, not it's homologues. I just went through several cycles of reducing E-value down to If you know the UniProt accession code of your protein, then UniPDB is your friend - pdbe.org/unipdb If not, try pdbe.org/fasta where you can supply the sequence and the %-age SI cut-off --Gerard