Re: Parsing file

Rob Coops Thu, 02 Jun 2011 08:07:06 -0700

On Thu, Jun 2, 2011 at 4:41 PM, venkates <venka...@nt.ntnu.no> wrote:


> On 6/2/2011 2:44 PM, Rob Coops wrote:
>
>> On Thu, Jun 2, 2011 at 1:28 PM, venkates<venka...@nt.ntnu.no>  wrote:
>>
>>  On 6/2/2011 12:46 PM, John SJ Anderson wrote:
>>>
>>>  On Thu, Jun 2, 2011 at 06:41, venkates<venka...@nt.ntnu.no>   wrote:
>>>>
>>>>  Hi,
>>>>>
>>>>> I want to parse a file with contents that looks as follows:
>>>>>
>>>>>  [ snip ]
>>>>
>>>> Have you considered using this module? ->
>>>> <http://search.cpan.org/dist/BioPerl/Bio/SeqIO/kegg.pm>
>>>>
>>>> Alternatively, I think somebody on the BioPerl mailing list was
>>>> working on another KEGG parser...
>>>>
>>>> chrs,
>>>> j.
>>>>
>>>>  I am doing this as an exercise  to learn parsing techniques so guidance
>>>>
>>> help needed.
>>>
>>> Aravind
>>>
>>>
>>>
>>> --
>>> To unsubscribe, e-mail: beginners-unsubscr...@perl.org
>>> For additional commands, e-mail: beginners-h...@perl.org
>>> http://learn.perl.org/
>>>
>>>
>>>
>>>  This is a simple and ugly way of parsing your file:
>>
>> use strict;
>> use warnings;
>> use Carp;
>> use Data::Dumper;
>>
>> my $set = parse("ko");
>>
>> sub parse {
>>  my $keggFile = shift;
>>  my $keggHash;
>>
>>  my $counter = 1;
>>
>>  open my $fh, '<', $keggFile || croak ("Cannot open file '$keggFile':
>> $!");
>>  while (<$fh>  ) {
>>   chomp;
>>   if ( $_ =~ m!///! ) {
>>    $counter++;
>>    next;
>>   }
>>
>>   if ( $_ =~ /^ENTRY\s+(.+?)\s/sm ) { ${$keggHash}{$counter} = { 'ENTRY'
>> =>
>> $1 }; }
>>
> While trying a similar thing for DEFINITION record, instead of appending
> current hash with ENTRY and NAME, the DEFINITION record replaces the
> contents in the hash?
>
> $VAR1 = {
>          '4' => {
>                   'DEFINITION' => 'U18 small nucleolar RNA'
>                 },
>          '1' => {
>                   'DEFINITION' => 'alcohol dehydrogenase [EC:1.1.1.1]'
>                 },
>          '3' => {
>                   'DEFINITION' => 'U14 small nucleolar RNA'
>                 },
>          '2' => {
>                   'DEFINITION' => 'alcohol dehydrogenase (NADP+)
> [EC:1.1.1.2]'
>                 },
>          '5' => {
>                   'DEFINITION' => 'U24 small nucleolar RNA'
>                 }
>        };
>
> code: in addition to what you had suggested -
> if($_ =~ /^DEFINITION\s{2}(.+)?/){
>               ${$keggHash}{$counter} = {'DEFINITION' => $1};
>
>           }
>
>>   if ( $_ =~ /^NAME\s+(.*)$/sm ) {
>>    my $temp = $1;
>>    $temp =~ s/,\s/,/g;
>>    my @names = split /,/, $temp;
>>    push @{${$keggHash}{$counter}{'NAME'}}, @names;
>>   }
>>  }
>>  close $fh;
>>  print Dumper $keggHash;
>> }
>>
>> The output being:
>>
>> $VAR1 = {
>>           '1' =>  {
>>                    'NAME' =>  [
>>                                'E1.1.1.1',
>>                                'adh'
>>                              ],
>>                    'ENTRY' =>  'K00001'
>>                  },
>>           '3' =>  {
>>                    'NAME' =>  [
>>                                'U18snoRNA',
>>                                'snR18'
>>                              ],
>>                    'ENTRY' =>  'K14866'
>>                  },
>>           '2' =>  {
>>                    'NAME' =>  [
>>                                'U14snoRNA',
>>                                'snR128'
>>                              ],
>>                    'ENTRY' =>  'K14865'
>>                  }
>>         };
>>
>> Which to me looks sort of like what you are looking for.
>> The main thing I did was read the file one line at a time to prevent a
>> unexpectedly large file from causing memory issues on your machine (in the
>> end the structure that you are building will cause enough issues
>> when handling a large file.
>>
>> You already dealt with the Entry bit so I'll leave that open though I
>> slightly changed the regex but nothing spectacular there.
>> The Name bit is simple as I just pull out all of them then then remove all
>> spaces and split them into an array, feed the array to the hash and hop
>> time
>> for the next step which is up to you ;-)
>>
>> I hope it helps you a bit, regards,
>>
>> Rob
>>
>>
>
What you do: ${$keggHash}{$counter} = {'DEFINITION' => $1};
Try the following:   $keggHash}{$counter}{'DEFINITION'} = $1;

To make things a little clearer look at the following example.

my %hash;
$hash{'Key 1'} = { 'Nested Key 1' => 'Value 1' };

What you do is say: $hash{'Key 1'} = { 'Nested Key 2' => 'Value 2' }
What I do is: $hash{'Key 1'}{'Nested Key 2'} = 'Value 2'}

In your script you will end up with the following:
$VAR1 = {
         'Key 1' => {
                  'Nested Key 2' => 'Value 2',
                },
};

Where mine will result in:
$VAR1 = {
         'Key 1' => {
                  'Nested Key 1' => 'Value 1',
                  'Nested Key 2' => 'Value 2',
                },
};

Not that much different but you are basically over writting the value (
{NAME=>[], ENTRY=>''} ) associated with your key ($counter) with {
'DESCRIPTION' => ''}. If you instead add a new key to the hash that is
associated with your main key ($counter) then you will get the result you
are looking for.

Regards,

Rob

Re: Parsing file

Reply via email to