I don't have time to work out the details, but if I were faced
with this problem, I'd use a hash of hashes to store the blocks, with
the outer key set to the block names, and the inner keys set to
the CR05--- whatever.
Use regular expressions to look for the string "Sequence" followed
by some stuff (which you store into a scalar until you have the count)
initialize an anonymous hash to store the counts by whatever strings
you need to search for, and increment the counts as you scan. When
you hit the next occurrence of "Sequence," store the anonymous hash
as the value in the main hash and create a new anonymous hash.
If you don't know about hashes of hashes and anonymous hashes, read
(and study) chapter 4 of the Camel book.
Kim
On Jan 17, 2005, at 12:05 PM, Marco Takita wrote:
Thanks Andrew for your input!
But the script still gives me the result for the total number of times
they appear in the text. What I need now is to get the results for
individual blocks, something like this:
input file
Sequence Contig3772
Assembled_from CR05-C1-102-004-_A01_-CT.F_008.ab1 -40 955
Assembled_from CR05-C1-102-006-_E05_-CT.F_035.ab1 -40 972
Assembled_from CR05-C1-102-004-_B01_-CT.F_007.ab1 -32 1007
Assembled_from CR05-C1-103-033-_G08_-CT.F_026.ab1 397 1400
Assembled_from CR05-C1-102-060-_D07_-CT.F_029.ab1 403 1450
Assembled_from CR05-C1-102-008-_G03_-CT.F_010.ab1 404 1427
Assembled_from CR05-C1-102-065-_F12_-CT.F_043.ab1 406 1498
Sequence Contig3773
Assembled_from CR05-C1-103-041-_E11_-CT.F_044.ab1 -694 275
Assembled_from CR05-C1-102-019-_A11_-CT.F_048.ab1 -626 289
Assembled_from CR05-C1-102-019-_D03_-CT.F_013.ab1 -625 314
Assembled_from CR05-C1-102-019-_B11_-CT.F_047.ab1 -733 185
output:
Contig 3772
CR05-C1-102 6 CR05-C1-103 1
Contig 3773
CR05-C1-102 3 CR05-C1-103 1
I believe that it is not very complicated to do that but it is just
that I'm able to do that by myself...
Marco Takita
On Jan 17, 2005, at 5:34 PM, Andrew Mace wrote:
Why not something like:
my %sequences = ();
my $seq;
while(<>) {
if($_ =~ m/^Sequence ([^\n]+)$/) {
$seq = $1;
$sequences{$1} = [0,0];
} elsif($_ =~ m/CR05-C1-10(\d)/) {
if($1 == 2) {
$sequences{$seq}->[0]++;
} elsif($1 == 3) {
$sequences{$seq}->[1]++;
}
}
}
my $total_102 = 0;
my $total_103 = 0;
for(keys %sequences) {
print $_, ": 102 = ", $sequences{$_}->[0], "; 103 = ",
$sequences{$_}->[1], "\n";
$total_102 += $sequences{$_}->[0];
$total_103 += $sequences{$_}->[1];
}
print "Total 102 = ", $total_102, "\n";
print "Total 103 = ", $total_103, "\n";
Andrew
On Jan 17, 2005, at 2:04 PM, Marco Takita wrote:
Hi guys, sorry for the question not directly related to macosx but
this is the OS I work with and I know that you guys are really
helpful.
I'm really new to perl. Actually I'm trying write my very first
script. Let me try to explain what I need. I have a large text file
that is basically something like this:
Sequence Contig3772
Assembled_from CR05-C1-102-004-_A01_-CT.F_008.ab1 -40 955
Assembled_from CR05-C1-102-006-_E05_-CT.F_035.ab1 -40 972
Assembled_from CR05-C1-102-004-_B01_-CT.F_007.ab1 -32 1007
Assembled_from CR05-C1-103-033-_G08_-CT.F_026.ab1 397 1400
Assembled_from CR05-C1-102-060-_D07_-CT.F_029.ab1 403 1450
Assembled_from CR05-C1-102-008-_G03_-CT.F_010.ab1 404 1427
Assembled_from CR05-C1-102-065-_F12_-CT.F_043.ab1 406 1498
Sequence Contig3773
Assembled_from CR05-C1-103-041-_E11_-CT.F_044.ab1 -694 275
Assembled_from CR05-C1-102-019-_A11_-CT.F_048.ab1 -626 289
Assembled_from CR05-C1-102-019-_D03_-CT.F_013.ab1 -625 314
Assembled_from CR05-C1-102-019-_B11_-CT.F_047.ab1 -733 185
Sequence Contig3774
and so on.
What I need is to count how many times either CR05-C1-102 or
CR05-C1-103 appears in the text, which I was able to do:
#!/usr/bin/perl
while (<>) {
chomp;
@text = (CR05-C1-102,CR05-C1-103);
foreach $wd (split) {
if ($wd =~ @text[0], @text[1]){
if ($wd =~ @text[0]){
$score++;
}
if ($wd =~ @text[1]){
$res++;
}
}
}
}
print " CR05-C1-102 $score CR05-C1-103 $res \n\n";
My problem is that I cannot do that for individual blocks like:
Sequence Contig3772
Assembled_from CR05-C1-102-004-_A01_-CT.F_008.ab1 -40 955
Assembled_from CR05-C1-102-006-_E05_-CT.F_035.ab1 -40 972
Assembled_from CR05-C1-102-004-_B01_-CT.F_007.ab1 -32 1007
Assembled_from CR05-C1-103-033-_G08_-CT.F_026.ab1 397 1400
Assembled_from CR05-C1-102-060-_D07_-CT.F_029.ab1 403 1450
Assembled_from CR05-C1-102-008-_G03_-CT.F_010.ab1 404 1427
Assembled_from CR05-C1-102-065-_F12_-CT.F_043.ab1 406 1498
I was not able to isolate this block from the rest of the text.
Any idea how to do that?
Thanks a lot
Dr. Marco Aurélio Takita, Ph.D.
Centro APTA Citros Sylvio Moreira
Rodovia Anhanguera Km 158
Caixa Postal 04
13490-970 Cordeirópolis - SP, BRAZIL
Tel.: 55-19-35461399