Re: perl file parsing

Rob Dixon Thu, 23 Oct 2008 18:07:02 -0700

minky arora wrote:
> 
> I have a file of the follwoing form
> 
>> FFM50HR02GMY4E length=75 xy=2604_3772 region=2 run=R_2008_08_19_08_32_31_
> 
> GGGGTCAATGGGTCCGACGGAGAAAGCGCGACAGAGGGGAAAGCCCTTTCCCCTCCCCGT
> 
> TCGACTAGCGTCGTG
> 
>> FFM50HR02F5QTS length=59 xy=2408_2686 region=2 run=R_2008_08_19_08_32_31_
> 
> AGGACATGCGGCCCGGCGACCTCATCATCTACTTCGACGACGCCAGCCACGTCGGGATG
> 
> 
> 
> 
> 
> It has over 5000 such blocks, each starting with ">". I need to search for a
> given pattern (String of characters) in the second line of each block and
> then print the block header (>FFM50HR02F5QTS). I only need to parse the
> first 500 blocks of each file. Of these 500 blocks, I then need to output
> the number of times the pattern has occured. My code is below. I didn't
> think I has missed anythign till I manually went into each file to compare
> the results, which don't match. Can someone point me to whats going wrong
> here?
> 
> 
> 
> 
> 
> 
> 
> #!/usr/bin/perl
> 
> $file_to_parse = "/home/myfile";
> $pattern = "CTTGGCGAGAAGGGCCGCTACCTGCTGGCCGCCTCCTTCGGCAACGT";
> #$pattern = "abc";
> 
> $max_blocks = 500;
> 
> # open the data file
> open (DAT, "$file_to_parse") || die ("Cannot open file: $file_to_parse");
> $match_count = 0;
> $block_count = 0;
> $block = "";
> while (<DAT>){
> 
>  chomp (); #remove newline characters
> 
>  if ($_ =~ /^>/ && $. > 0){ #beginning of the next block reached
> 
>   #look for matches in the current block
> 
>   if ($block_count <= $max_blocks){ # check not more than $max_blocks
> 
>    $num_matches = () = $block =~ /$pattern/g; #how many matches in this
> block
>    $match_count += $num_matches; #increase global match coutner
> 
>    $block =~ /^(>.+?)\s/g; #get block ID, e.g. >FIFKRKM06HCSVV
>    $block_id = $1;
> 
>    if ($num_matches > 0){ #output information
>     print "Block ID: $block_id\nBlock #: $block_count\nNumber of matches in
> this block: $num_matches\n\n";
>    }
> 
>   }
> 
>   $block = ""; #empty block holder variable
>   $block_count++; #increase block count
> 
> 
>  }
> 
>  #build the block, concatenate lines
>  $block .= $_;
> 
> }
> close DAT;
> 
> print "Max number of blocks to search: $max_blocks\n";
> print "Number of blocks found in this file: $block_count\n";
> print "Total matches in $max_blocks blocks: $match_count\n\n";
> 
> # exit
> exit;


I'm afraid there is too much wrong with your program for me to try to rescue it.
Instead, I hope I can make some suggestions and point a few things out.

First of all, /always/

  use strict;
  use warnings;

at the start of every program, especially one that you are asking for help with.
You will then have to declare every variable using 'my', and it will save you
from a lot of simple mistakes.

Next, it is a bad idea to make /anything/ without trying it part-way through
building. If I was making a motor car from a collection of parts, no matter how
carefully I had followed the manual, I would be amazed if I could simply get
into the seat, turn the key, and drive down the road. But that is what you have
done with your program, and is what many less experience programmers expect to 
do.

Instead, you should write an incremental series of programs, with targets
something like this:

1 - Open your file, and make sure that you can read and print each line

2 - Print out just the block IDs in the file

3 - Accumulate the block data and print that out with its block ID

4 - Search for and count the substring within each block, and show those results
    too

After that, but not before, you will have something approaching a working 
solution.

I won't say much more except that you are getting confused about what is in
$block and $block_id. Because the ID appears before the data in the file you are
associating each ID with the data in the previous block. Apart from being simply
wrong, it means that the first block ID has no corresponding data, and the data
from the last block in the file is just thrown away.

One more thing. $. is always greater than zero after a successful file read.

HTH,

Rob

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/

Re: perl file parsing

Reply via email to