On 30/12/13 22:45, Oliver Clarke wrote:
Great! Next scripting question that I’m puzzling over - how to take a set of 
sequences (present in a single fasta file), in arbitrary order, match and align 
them to their corresponding homologous subunits in a PDB file, and use 
align_and_mutate to correct the sequence of each subunit (and delete the gaps).

Right now, I’m stuck on the first step - how to take a single sequence from a 
fasta file and output the matching chain ID.

I can use:

import subprocess
align_to_closest_chain("%s"%(subprocess.check_output(["tail","+2","test.fasta"])).upper(),
 0.95)

FWIW, I think that using "tail" is ugly and you should read the contents of the fasta file into a list of strings.

def file_name_to_string_list(file_name):
    if os.path.isfile(file_name):
        f = open(file_name, 'r')
        return f.readlines()
    else:
        return []

Maybe there is a python built-in for this already? :)

Then use (say):

fasta_seqs = file_name_to_string_list('all.fasta')
align_to_closest_chain(fasta_seqs[2].upper(), 0.95)


to check for a matching chain, but the output of align_to_closest_chain() is 
either 0 or 1.

The matching chain id is present in the console output, but I haven’t yet 
figured out how to parse that and do anything useful with it - the console 
output does not seem be easily accessible via either stdout or stderr (was 
hoping I could redirect the console output to a file and grep it for the chain 
id - ugly but I don’t know a better way).

OK, that's a good point.

So the policy is that interesting variables/values should be returned by functions. In this case it would be good to know the selected chain-id, but you don't get that - so that's a problem. I'll fix that now. In future this function will return False or the molecule number and chain-id.


Also it is not immediately clear to me how the match_fraction parameter relates 
to sequence identity or alignment score for a given match (The alignment also 
seems to be case sensitive, which doesn’t seem right, hence the .upper() in the 
above).

Yes, I think the input sequence for coot is case sensitive (and it should not be).

IIRC, the alignment score has be at least as close as (as good as) match_fraction, otherwise it is not performed.


What am i doing wrong - thoughts/suggestions? Would it be better to do the 
sequence handling steps externally?

I'll fix it and update later.

Paul.

Reply via email to