Kim Helliwell
Mon, 17 Jan 2005 14:34:21 -0800
I don't have time to work out the details, but if I were faced with this problem, I'd use a hash of hashes to store the blocks, with the outer key set to the block names, and the inner keys set to the CR05--- whatever.
Use regular expressions to look for the string "Sequence" followed by some stuff (which you store into a scalar until you have the count) initialize an anonymous hash to store the counts by whatever strings you need to search for, and increment the counts as you scan. When you hit the next occurrence of "Sequence," store the anonymous hash as the value in the main hash and create a new anonymous hash.
If you don't know about hashes of hashes and anonymous hashes, read (and study) chapter 4 of the Camel book.
Kim
On Jan 17, 2005, at 12:05 PM, Marco Takita wrote:
Thanks Andrew for your input!
But the script still gives me the result for the total number of times they appear in the text. What I need now is to get the results for individual blocks, something like this:
input fileSequence Contig3772 Assembled_from CR05-C1-102-004-_A01_-CT.F_008.ab1 -40 955 Assembled_from CR05-C1-102-006-_E05_-CT.F_035.ab1 -40 972 Assembled_from CR05-C1-102-004-_B01_-CT.F_007.ab1 -32 1007 Assembled_from CR05-C1-103-033-_G08_-CT.F_026.ab1 397 1400 Assembled_from CR05-C1-102-060-_D07_-CT.F_029.ab1 403 1450 Assembled_from CR05-C1-102-008-_G03_-CT.F_010.ab1 404 1427 Assembled_from CR05-C1-102-065-_F12_-CT.F_043.ab1 406 1498
Sequence Contig3773 Assembled_from CR05-C1-103-041-_E11_-CT.F_044.ab1 -694 275 Assembled_from CR05-C1-102-019-_A11_-CT.F_048.ab1 -626 289 Assembled_from CR05-C1-102-019-_D03_-CT.F_013.ab1 -625 314 Assembled_from CR05-C1-102-019-_B11_-CT.F_047.ab1 -733 185
output:
Contig 3772
CR05-C1-102 6 CR05-C1-103 1
Contig 3773
CR05-C1-102 3 CR05-C1-103 1
I believe that it is not very complicated to do that but it is just that I'm able to do that by myself...
Marco Takita
On Jan 17, 2005, at 5:34 PM, Andrew Mace wrote:
Why not something like:
my %sequences = (); my $seq;
while(<>) { if($_ =~ m/^Sequence ([^\n]+)$/) { $seq = $1; $sequences{$1} = [0,0]; } elsif($_ =~ m/CR05-C1-10(\d)/) { if($1 == 2) { $sequences{$seq}->[0]++; } elsif($1 == 3) { $sequences{$seq}->[1]++; } } }
my $total_102 = 0;
my $total_103 = 0;
for(keys %sequences) {
print $_, ": 102 = ", $sequences{$_}->[0], "; 103 = ", $sequences{$_}->[1], "\n";
$total_102 += $sequences{$_}->[0];
$total_103 += $sequences{$_}->[1];
}
print "Total 102 = ", $total_102, "\n"; print "Total 103 = ", $total_103, "\n";
Andrew
On Jan 17, 2005, at 2:04 PM, Marco Takita wrote:
Hi guys, sorry for the question not directly related to macosx but this is the OS I work with and I know that you guys are really helpful.
I'm really new to perl. Actually I'm trying write my very first script. Let me try to explain what I need. I have a large text file that is basically something like this:
Sequence Contig3772 Assembled_from CR05-C1-102-004-_A01_-CT.F_008.ab1 -40 955 Assembled_from CR05-C1-102-006-_E05_-CT.F_035.ab1 -40 972 Assembled_from CR05-C1-102-004-_B01_-CT.F_007.ab1 -32 1007 Assembled_from CR05-C1-103-033-_G08_-CT.F_026.ab1 397 1400 Assembled_from CR05-C1-102-060-_D07_-CT.F_029.ab1 403 1450 Assembled_from CR05-C1-102-008-_G03_-CT.F_010.ab1 404 1427 Assembled_from CR05-C1-102-065-_F12_-CT.F_043.ab1 406 1498
Sequence Contig3773 Assembled_from CR05-C1-103-041-_E11_-CT.F_044.ab1 -694 275 Assembled_from CR05-C1-102-019-_A11_-CT.F_048.ab1 -626 289 Assembled_from CR05-C1-102-019-_D03_-CT.F_013.ab1 -625 314 Assembled_from CR05-C1-102-019-_B11_-CT.F_047.ab1 -733 185
Sequence Contig3774
and so on.
What I need is to count how many times either CR05-C1-102 or CR05-C1-103 appears in the text, which I was able to do:
#!/usr/bin/perl
while (<>) {
chomp;
@text = (CR05-C1-102,CR05-C1-103);
foreach $wd (split) {
if ($wd =~ @text[0], @text[1]){ if ($wd =~ @text[0]){ $score++; } if ($wd =~ @text[1]){ $res++; } } } }
print " CR05-C1-102 $score CR05-C1-103 $res \n\n";
My problem is that I cannot do that for individual blocks like:
Sequence Contig3772 Assembled_from CR05-C1-102-004-_A01_-CT.F_008.ab1 -40 955 Assembled_from CR05-C1-102-006-_E05_-CT.F_035.ab1 -40 972 Assembled_from CR05-C1-102-004-_B01_-CT.F_007.ab1 -32 1007 Assembled_from CR05-C1-103-033-_G08_-CT.F_026.ab1 397 1400 Assembled_from CR05-C1-102-060-_D07_-CT.F_029.ab1 403 1450 Assembled_from CR05-C1-102-008-_G03_-CT.F_010.ab1 404 1427 Assembled_from CR05-C1-102-065-_F12_-CT.F_043.ab1 406 1498
I was not able to isolate this block from the rest of the text.
Any idea how to do that?
Thanks a lot
Dr. Marco Aurélio Takita, Ph.D. Centro APTA Citros Sylvio Moreira Rodovia Anhanguera Km 158 Caixa Postal 04 13490-970 Cordeirópolis - SP, BRAZIL Tel.: 55-19-35461399