Thanks Andrew for your input!

But the script still gives me the result for the total number of times they appear in the text. What I need now is to get the results for individual blocks, something like this:

input file
Sequence Contig3772
Assembled_from  CR05-C1-102-004-_A01_-CT.F_008.ab1  -40  955
Assembled_from  CR05-C1-102-006-_E05_-CT.F_035.ab1  -40  972
Assembled_from  CR05-C1-102-004-_B01_-CT.F_007.ab1  -32  1007
Assembled_from  CR05-C1-103-033-_G08_-CT.F_026.ab1  397  1400
Assembled_from  CR05-C1-102-060-_D07_-CT.F_029.ab1  403  1450
Assembled_from  CR05-C1-102-008-_G03_-CT.F_010.ab1  404  1427
Assembled_from  CR05-C1-102-065-_F12_-CT.F_043.ab1  406  1498


Sequence Contig3773 Assembled_from CR05-C1-103-041-_E11_-CT.F_044.ab1 -694 275 Assembled_from CR05-C1-102-019-_A11_-CT.F_048.ab1 -626 289 Assembled_from CR05-C1-102-019-_D03_-CT.F_013.ab1 -625 314 Assembled_from CR05-C1-102-019-_B11_-CT.F_047.ab1 -733 185

output:

Contig 3772

CR05-C1-102 6 CR05-C1-103 1

Contig 3773

CR05-C1-102 3 CR05-C1-103 1

I believe that it is not very complicated to do that but it is just that I'm able to do that by myself...

Marco Takita

On Jan 17, 2005, at 5:34 PM, Andrew Mace wrote:

Why not something like:

my %sequences = ();
my $seq;

while(<>) {
        if($_ =~ m/^Sequence ([^\n]+)$/) {
                $seq = $1;
                $sequences{$1} = [0,0];
        } elsif($_ =~ m/CR05-C1-10(\d)/) {
                if($1 == 2) {
                        $sequences{$seq}->[0]++;
                } elsif($1 == 3) {
                        $sequences{$seq}->[1]++;
                }
        }
}

my $total_102 = 0;
my $total_103 = 0;
for(keys %sequences) {
print $_, ": 102 = ", $sequences{$_}->[0], "; 103 = ", $sequences{$_}->[1], "\n";
$total_102 += $sequences{$_}->[0];
$total_103 += $sequences{$_}->[1];
}


print "Total 102 = ", $total_102, "\n";
print "Total 103 = ", $total_103, "\n";


Andrew




On Jan 17, 2005, at 2:04 PM, Marco Takita wrote:

Hi guys, sorry for the question not directly related to macosx but this is the OS I work with and I know that you guys are really helpful.

I'm really new to perl. Actually I'm trying write my very first script. Let me try to explain what I need. I have a large text file that is basically something like this:

Sequence Contig3772
Assembled_from  CR05-C1-102-004-_A01_-CT.F_008.ab1  -40  955
Assembled_from  CR05-C1-102-006-_E05_-CT.F_035.ab1  -40  972
Assembled_from  CR05-C1-102-004-_B01_-CT.F_007.ab1  -32  1007
Assembled_from  CR05-C1-103-033-_G08_-CT.F_026.ab1  397  1400
Assembled_from  CR05-C1-102-060-_D07_-CT.F_029.ab1  403  1450
Assembled_from  CR05-C1-102-008-_G03_-CT.F_010.ab1  404  1427
Assembled_from  CR05-C1-102-065-_F12_-CT.F_043.ab1  406  1498


Sequence Contig3773 Assembled_from CR05-C1-103-041-_E11_-CT.F_044.ab1 -694 275 Assembled_from CR05-C1-102-019-_A11_-CT.F_048.ab1 -626 289 Assembled_from CR05-C1-102-019-_D03_-CT.F_013.ab1 -625 314 Assembled_from CR05-C1-102-019-_B11_-CT.F_047.ab1 -733 185

Sequence  Contig3774

and so on.

What I need is to count how many times either CR05-C1-102 or CR05-C1-103 appears in the text, which I was able to do:

#!/usr/bin/perl

while (<>) {

 chomp;

@text = (CR05-C1-102,CR05-C1-103);

         foreach $wd (split) {

        if ($wd =~ @text[0], @text[1]){
        if ($wd =~ @text[0]){
        $score++;
        }
        if ($wd =~ @text[1]){
        $res++;
           }
        }
      }
   }


print " CR05-C1-102 $score CR05-C1-103 $res \n\n";

My problem is that I cannot do that for individual blocks like:

Sequence Contig3772
Assembled_from  CR05-C1-102-004-_A01_-CT.F_008.ab1  -40  955
Assembled_from  CR05-C1-102-006-_E05_-CT.F_035.ab1  -40  972
Assembled_from  CR05-C1-102-004-_B01_-CT.F_007.ab1  -32  1007
Assembled_from  CR05-C1-103-033-_G08_-CT.F_026.ab1  397  1400
Assembled_from  CR05-C1-102-060-_D07_-CT.F_029.ab1  403  1450
Assembled_from  CR05-C1-102-008-_G03_-CT.F_010.ab1  404  1427
Assembled_from  CR05-C1-102-065-_F12_-CT.F_043.ab1  406  1498

I was not able to isolate this block from the rest of the text.

Any idea how to do that?

Thanks a lot

Dr. Marco Aurélio Takita, Ph.D.
Centro APTA Citros Sylvio Moreira
Rodovia Anhanguera Km 158
Caixa Postal 04
13490-970 Cordeirópolis - SP, BRAZIL
Tel.: 55-19-35461399




Reply via email to