On Thu, Jun 2, 2011 at 1:28 PM, venkates <venka...@nt.ntnu.no> wrote:

> On 6/2/2011 12:46 PM, John SJ Anderson wrote:
>
>> On Thu, Jun 2, 2011 at 06:41, venkates<venka...@nt.ntnu.no>  wrote:
>>
>>> Hi,
>>>
>>> I want to parse a file with contents that looks as follows:
>>>
>> [ snip ]
>>
>> Have you considered using this module? ->
>> <http://search.cpan.org/dist/BioPerl/Bio/SeqIO/kegg.pm>
>>
>> Alternatively, I think somebody on the BioPerl mailing list was
>> working on another KEGG parser...
>>
>> chrs,
>> j.
>>
>>  I am doing this as an exercise  to learn parsing techniques so guidance
> help needed.
>
> Aravind
>
>
>
> --
> To unsubscribe, e-mail: beginners-unsubscr...@perl.org
> For additional commands, e-mail: beginners-h...@perl.org
> http://learn.perl.org/
>
>
>
This is a simple and ugly way of parsing your file:

use strict;
use warnings;
use Carp;
use Data::Dumper;

my $set = parse("ko");

sub parse {
 my $keggFile = shift;
 my $keggHash;

 my $counter = 1;

 open my $fh, '<', $keggFile || croak ("Cannot open file '$keggFile': $!");
 while ( <$fh> ) {
  chomp;
  if ( $_ =~ m!///! ) {
   $counter++;
   next;
  }

  if ( $_ =~ /^ENTRY\s+(.+?)\s/sm ) { ${$keggHash}{$counter} = { 'ENTRY' =>
$1 }; }
  if ( $_ =~ /^NAME\s+(.*)$/sm ) {
   my $temp = $1;
   $temp =~ s/,\s/,/g;
   my @names = split /,/, $temp;
   push @{${$keggHash}{$counter}{'NAME'}}, @names;
  }
 }
 close $fh;
 print Dumper $keggHash;
}

The output being:

$VAR1 = {
          '1' => {
                   'NAME' => [
                               'E1.1.1.1',
                               'adh'
                             ],
                   'ENTRY' => 'K00001'
                 },
          '3' => {
                   'NAME' => [
                               'U18snoRNA',
                               'snR18'
                             ],
                   'ENTRY' => 'K14866'
                 },
          '2' => {
                   'NAME' => [
                               'U14snoRNA',
                               'snR128'
                             ],
                   'ENTRY' => 'K14865'
                 }
        };

Which to me looks sort of like what you are looking for.
The main thing I did was read the file one line at a time to prevent a
unexpectedly large file from causing memory issues on your machine (in the
end the structure that you are building will cause enough issues
when handling a large file.

You already dealt with the Entry bit so I'll leave that open though I
slightly changed the regex but nothing spectacular there.
The Name bit is simple as I just pull out all of them then then remove all
spaces and split them into an array, feed the array to the hash and hop time
for the next step which is up to you ;-)

I hope it helps you a bit, regards,

Rob

Reply via email to