I don't have time to work out the details, but if I were faced
with this problem, I'd use a hash of hashes to store the blocks, with
the outer key set to the block names, and the inner keys set to
the CR05--- whatever.

Use regular expressions to look for the string "Sequence" followed
by some stuff (which you store into a scalar until you have the count)
initialize an anonymous hash to store the counts by whatever strings
you need to search for, and increment the counts as you scan. When
you hit the next occurrence of "Sequence," store the anonymous hash
as the value in the main hash and create a new anonymous hash.

If you don't know about hashes of hashes and anonymous hashes, read
(and study) chapter 4 of the Camel book.


Kim


On Jan 17, 2005, at 12:05 PM, Marco Takita wrote:

Thanks Andrew for your input!

But the script still gives me the result for the total number of times they appear in the text. What I need now is to get the results for individual blocks, something like this:

input file
Sequence Contig3772
Assembled_from  CR05-C1-102-004-_A01_-CT.F_008.ab1  -40  955
Assembled_from  CR05-C1-102-006-_E05_-CT.F_035.ab1  -40  972
Assembled_from  CR05-C1-102-004-_B01_-CT.F_007.ab1  -32  1007
Assembled_from  CR05-C1-103-033-_G08_-CT.F_026.ab1  397  1400
Assembled_from  CR05-C1-102-060-_D07_-CT.F_029.ab1  403  1450
Assembled_from  CR05-C1-102-008-_G03_-CT.F_010.ab1  404  1427
Assembled_from  CR05-C1-102-065-_F12_-CT.F_043.ab1  406  1498


Sequence Contig3773 Assembled_from CR05-C1-103-041-_E11_-CT.F_044.ab1 -694 275 Assembled_from CR05-C1-102-019-_A11_-CT.F_048.ab1 -626 289 Assembled_from CR05-C1-102-019-_D03_-CT.F_013.ab1 -625 314 Assembled_from CR05-C1-102-019-_B11_-CT.F_047.ab1 -733 185

output:

Contig 3772

CR05-C1-102 6 CR05-C1-103 1

Contig 3773

CR05-C1-102 3 CR05-C1-103 1

I believe that it is not very complicated to do that but it is just that I'm able to do that by myself...

Marco Takita

On Jan 17, 2005, at 5:34 PM, Andrew Mace wrote:

Why not something like:

my %sequences = ();
my $seq;

while(<>) {
        if($_ =~ m/^Sequence ([^\n]+)$/) {
                $seq = $1;
                $sequences{$1} = [0,0];
        } elsif($_ =~ m/CR05-C1-10(\d)/) {
                if($1 == 2) {
                        $sequences{$seq}->[0]++;
                } elsif($1 == 3) {
                        $sequences{$seq}->[1]++;
                }
        }
}

my $total_102 = 0;
my $total_103 = 0;
for(keys %sequences) {
print $_, ": 102 = ", $sequences{$_}->[0], "; 103 = ", $sequences{$_}->[1], "\n";
$total_102 += $sequences{$_}->[0];
$total_103 += $sequences{$_}->[1];
}


print "Total 102 = ", $total_102, "\n";
print "Total 103 = ", $total_103, "\n";


Andrew




On Jan 17, 2005, at 2:04 PM, Marco Takita wrote:

Hi guys, sorry for the question not directly related to macosx but this is the OS I work with and I know that you guys are really helpful.

I'm really new to perl. Actually I'm trying write my very first script. Let me try to explain what I need. I have a large text file that is basically something like this:

Sequence Contig3772
Assembled_from  CR05-C1-102-004-_A01_-CT.F_008.ab1  -40  955
Assembled_from  CR05-C1-102-006-_E05_-CT.F_035.ab1  -40  972
Assembled_from  CR05-C1-102-004-_B01_-CT.F_007.ab1  -32  1007
Assembled_from  CR05-C1-103-033-_G08_-CT.F_026.ab1  397  1400
Assembled_from  CR05-C1-102-060-_D07_-CT.F_029.ab1  403  1450
Assembled_from  CR05-C1-102-008-_G03_-CT.F_010.ab1  404  1427
Assembled_from  CR05-C1-102-065-_F12_-CT.F_043.ab1  406  1498


Sequence Contig3773 Assembled_from CR05-C1-103-041-_E11_-CT.F_044.ab1 -694 275 Assembled_from CR05-C1-102-019-_A11_-CT.F_048.ab1 -626 289 Assembled_from CR05-C1-102-019-_D03_-CT.F_013.ab1 -625 314 Assembled_from CR05-C1-102-019-_B11_-CT.F_047.ab1 -733 185

Sequence  Contig3774

and so on.

What I need is to count how many times either CR05-C1-102 or CR05-C1-103 appears in the text, which I was able to do:

#!/usr/bin/perl

while (<>) {

 chomp;

@text = (CR05-C1-102,CR05-C1-103);

         foreach $wd (split) {

        if ($wd =~ @text[0], @text[1]){
        if ($wd =~ @text[0]){
        $score++;
        }
        if ($wd =~ @text[1]){
        $res++;
           }
        }
      }
   }


print " CR05-C1-102 $score CR05-C1-103 $res \n\n";

My problem is that I cannot do that for individual blocks like:

Sequence Contig3772
Assembled_from  CR05-C1-102-004-_A01_-CT.F_008.ab1  -40  955
Assembled_from  CR05-C1-102-006-_E05_-CT.F_035.ab1  -40  972
Assembled_from  CR05-C1-102-004-_B01_-CT.F_007.ab1  -32  1007
Assembled_from  CR05-C1-103-033-_G08_-CT.F_026.ab1  397  1400
Assembled_from  CR05-C1-102-060-_D07_-CT.F_029.ab1  403  1450
Assembled_from  CR05-C1-102-008-_G03_-CT.F_010.ab1  404  1427
Assembled_from  CR05-C1-102-065-_F12_-CT.F_043.ab1  406  1498

I was not able to isolate this block from the rest of the text.

Any idea how to do that?

Thanks a lot

Dr. Marco Aurélio Takita, Ph.D.
Centro APTA Citros Sylvio Moreira
Rodovia Anhanguera Km 158
Caixa Postal 04
13490-970 Cordeirópolis - SP, BRAZIL
Tel.: 55-19-35461399





Reply via email to