On 30/12/13 22:45, Oliver Clarke wrote:
Great! Next scripting question that I’m puzzling over - how to take a set of
sequences (present in a single fasta file), in arbitrary order, match and align
them to their corresponding homologous subunits in a PDB file, and use
align_and_mutate to correct the sequence of each subunit (and delete the gaps).
Right now, I’m stuck on the first step - how to take a single sequence from a
fasta file and output the matching chain ID.
I can use:
import subprocess
align_to_closest_chain("%s"%(subprocess.check_output(["tail","+2","test.fasta"])).upper(),
0.95)
FWIW, I think that using "tail" is ugly and you should read the contents
of the fasta file into a list of strings.
def file_name_to_string_list(file_name):
if os.path.isfile(file_name):
f = open(file_name, 'r')
return f.readlines()
else:
return []
Maybe there is a python built-in for this already? :)
Then use (say):
fasta_seqs = file_name_to_string_list('all.fasta')
align_to_closest_chain(fasta_seqs[2].upper(), 0.95)
to check for a matching chain, but the output of align_to_closest_chain() is
either 0 or 1.
The matching chain id is present in the console output, but I haven’t yet
figured out how to parse that and do anything useful with it - the console
output does not seem be easily accessible via either stdout or stderr (was
hoping I could redirect the console output to a file and grep it for the chain
id - ugly but I don’t know a better way).
OK, that's a good point.
So the policy is that interesting variables/values should be returned by
functions. In this case it would be good to know the selected chain-id,
but you don't get that - so that's a problem. I'll fix that now. In
future this function will return False or the molecule number and chain-id.
Also it is not immediately clear to me how the match_fraction parameter relates
to sequence identity or alignment score for a given match (The alignment also
seems to be case sensitive, which doesn’t seem right, hence the .upper() in the
above).
Yes, I think the input sequence for coot is case sensitive (and it
should not be).
IIRC, the alignment score has be at least as close as (as good as)
match_fraction, otherwise it is not performed.
What am i doing wrong - thoughts/suggestions? Would it be better to do the
sequence handling steps externally?
I'll fix it and update later.
Paul.