A quick background overview. I have a relatively simple tool for generating six-frame translations of a genome (all possible protein segments encoded in a genome). I read through the input one codon (3 DNA bases) at a time, keeping the previous codon, and build the sequence fragments in six separate scalars. Every codon I check to see if any of the fragments end in a 'stop' codon, and if so call a function that checks if the fragment is >= the min length and if so formats the output and returns that. The scalar containing the fragment is passed by reference and 'cleared' ( $$sref = '') in the stop function. The formated output, if any, is accumulated in a temp variable and, if anything was returned, the contents are output. The original way I handled this caused, seemingly at random, the output of fragments to be repeated over a number of iterations, but after a small change the problem went away, although I would have thought the change to be equivilent. Here's a stripped down version of the first method:
my ($seq1,$cseq1,...) = '';
while($codon = &$read_code) {
my $six = $last . $codon;
$last = $codon;
$pos+=3;
$seq1 .= $AA{$codon} || "X";
complement(\$codon);
$cseq1 .= $AA{$codon} || "X";
# etc for other two frames
my $out = _stop(\$seq1,$pos,$minlen,$id) if $seq1 =~ /\.$/;
$out .= _stop(\$cseq1,$pos,$minlen,$id,$dna_len)
if $cseq1 =~ /\.$/;
# etc for all frames
print $fh_out $out if $out;
} # while reading codons from genome
Key lines in _stop:
if ($len < $min) { undef($$sref_seq);return '' }
$ret = "$id\t$$sref_seq\t$start\t$end\n";
$$sref_seq = '';
return $ret;
The above code causes the same fragment to be output multiple times.
Some debugging has concluded that _stop is only being called when
expected and that $seq1 etc. are being set to empty string as intended.
I also determined that the repeat was being printed in sucsesive
iterations of the while loop, only as long as a return from _stop didn't
assign a new value other than '' to $out. For example, if I output a
simple loop counter, then I would see somthing like:
141:
SixFrame <sequence> 278 141
142:
SixFrame <sequence> 278 141
143:
SixFrame <sequence> 278 141
144:
145:
146:
etc., where only the output on iteration 141 was expected.
Where the 'SixFrame' line is the content of $out. There doesn't seem to
be any obvious pattern to how many times the output would be repeated,
but I didn't bother to investigate that deeply. Changing the "my $out =
_stop" line in the while loop above to:
my $out = '';
$out .= _stop(\$seq1,$pos,$minlen,$id) if $seq1 =~ /\.$/;
Seems to have completely solved the problem. Is this some sort of
mistake on my part, some subtle/odd behavior that would cause this to be
expected in this usage (and if so please explain), or should I report
this as a bug [no, I haven't trolled the known bugs/fixes yet, sorry]
285: 11:45am % uname -a
Linux xxxx.xxx.xxx 2.4.20-43.9.legacysmp #1 SMP Tue Apr 26 08:08:36 EDT
2005 i686 athlon i386 GNU/Linux
286: 11:45am % perl -v
This is perl, v5.8.6 built for i686-linux-thread-multi
...
Thanks!
--
Sean Quinlan <[EMAIL PROTECTED]>
signature.asc
Description: This is a digitally signed message part
_______________________________________________ Boston-pm mailing list [email protected] http://mail.pm.org/mailman/listinfo/boston-pm

