Dear People,

        I am studying some topics on Bioinformatics and since I read
        a lot about Perl being employed in the area, I tried
        programming a very basic algorithm with it, the pair-wise
        sequence alignment algorithm.

        The problem is that while the very same algorithm runs quite
        satisfactorily with the same input data, the algorithm that I
        coded in Perl is about 40 times slower (I didn't test it with
        bigger instances, unfortunately).

        Since I am a newbie Perl programmer, I may be doing something
        utterly wrong in my program, some big "no-nos" and I would
        appreciate if somebody could point me to problems with my
        code, which I am sending below.

        In my code, I tried using only integers to make computations
        (so that it could run a bit faster) and to disable warning,
        but while the former did have a slight impact in the running
        time of my program, the latter didn't do anything noticeable.

        I also tried to "pre-allocate" (i.e., pre-extend) the arrays
        that I am using, with the hope that it would avoid Perl doing
        lots of reallocs() (if Perl uses it at all -- I'm guessing
        here), but the time taken also didn't change (or the change
        was negligible).

        Here is the program. As the input, it takes two sequences that
        it tries to "align" and calculate their edit distance (i.e.,
        the least amount of inserts, deletes, and character
        substitutions needed to change the first sequence into the
        second sequence).

        Please disregard the comments (in Portuguese, intended for my
        advisor).

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
#!/usr/bin/perl -w

use strict;
use integer;

sub p($$)
{
    return ($_[0] eq $_[1])?0:1;
}

sub min($$$)
{
    my ($min);

    $min = ($_[0] <= $_[1])?$_[0]:$_[1];
    $min = ($min  <= $_[2])?$min :$_[2];

    return $min;
}

my ($debug) = (0);
my ($s, $t, @s, @t); # sequencias a alinhar;

open(ENTRA, "seqs.txt");
chomp($s = <ENTRA>); #colocar verificacao de erros
chomp($t = <ENTRA>); #colocar entrada de acordo com o padrao FASTA
close(ENTRA);

# truque para facilitar acesso rapido a caracteres de $s e $t
# tambem elimina brancos indesejados -- posicao 0 inutilizada em $s e $t
@s = ("", split(/ */, $s));
@t = ("", split(/ */, $t));

my ($m, $n) = (scalar(@s)-1, scalar(@t)-1);

my ($i, $j);

# "aloca" matriz @a para evitar que o Perl faca realocacoes
my (@a); # matriz de prog. dinamica
$#a = $m; # linhas de 0 a $m

for ($i = 0; $i <= $m; $i++) {
    $a[$i]->[$n] = 0; # equivale a $a[$i] = malloc($n+1)
}

# inicializa matriz @a
for ($i = 0; $i <= $m; $i++) {
    $a[$i][0] = $i;
}
for ($j = 0; $j <= $n; $j++) {
    $a[0][$j] = $j;
}

# preenche a matriz @a
for ($i = 1; $i <= $m; $i++) {
    for ($j = 1; $j <= $n; $j++) {
        $a[$i][$j] = min($a[$i-1][$j] + p($s[$i], "-"),
                         $a[$i-1][$j-1] + p($s[$i], $t[$j]),
                         $a[$i][$j-1] + p("-", $t[$j]));
    }
}

if ($debug) {
    for ($i = 0; $i <= $m; $i++) {
        for ($j = 0; $j <= $n; $j++) {
            print "$a[$i][$j] ";
        }
        print "\n";
    }
}

# imprime o alinhamento
$i = $m; $j = $n;
my (@alin_s, @alin_t, @alin_c);

while ($i > 0 and $j > 0) {
    if ($a[$i][$j] == $a[$i-1][$j] + p($s[$i], "-")) {
        push(@alin_s, $s[$i]);
        push(@alin_t, "-");
        push(@alin_c, " ");
        $i--;
    }
    elsif ($a[$i][$j] == $a[$i-1][$j-1] + p($s[$i], $t[$j])) {
        push(@alin_s, $s[$i]);
        push(@alin_t, $t[$j]);
        push(@alin_c, ($s[$i] eq $t[$j])?"*":" ");
        $i--; $j--;
    }
    else {
        push(@alin_s, "-");
        push(@alin_t, $t[$j]);
        push(@alin_c, " ");
        $j--;
    }
}

while ($j > 0) { # seq. $s acabou
    push(@alin_s, "-");
    push(@alin_t, $t[$j]);
    push(@alin_c, " ");
    $j--;
}

while ($i > 0) { # seq. $t acabou
    push(@alin_s, $s[$i]);
    push(@alin_t, "-");
    push(@alin_c, " ");
    $i--;
}

print reverse(@alin_s), "\n";
print reverse(@alin_t), "\n";
print reverse(@alin_c), "\n";
print "dist(s, t) = $a[$m][$n]\n";
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

        So, any help will be more than welcome.


        Thanks in advance, Roger...


-- 
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
  Rogério Brito - [EMAIL PROTECTED] - http://www.ime.usp.br/~rbrito/
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to