Re: Parsing file

Rob Coops Thu, 02 Jun 2011 13:04:23 -0700

On Thu, Jun 2, 2011 at 8:32 PM, venkates <venka...@nt.ntnu.no> wrote:


> Hi,
>
> Thanks a lot for the help, i had one more question. How can add diff values
> from multiple lines to the same hash ref? for example in the snippet data
>
>
> PATHWAY     ko00010  Glycolysis / Gluconeogenesis
>            ko00071  Fatty acid metabolism
>            ko00350  Tyrosine metabolism
>            ko00625  Chloroalkane and chloroalkene degradation
>            ko00626  Naphthalene degradation
>
> I want it to stored in the following manner:
>
> 2' => {
>            'PATHWAY' => {
>                                          'ko00010' => 'Glycolysis /
> Gluconeogenesis'
>                                          'ko00071' => ' Fatty acid
> metabolism'
>
>                                    },
> };
>
> Thanks,
>
> Aravind
>
>
> On 6/2/2011 5:06 PM, Rob Coops wrote:
>
>> On Thu, Jun 2, 2011 at 4:41 PM, venkates<venka...@nt.ntnu.no>  wrote:
>>
>>  On 6/2/2011 2:44 PM, Rob Coops wrote:
>>>
>>>  On Thu, Jun 2, 2011 at 1:28 PM, venkates<venka...@nt.ntnu.no>   wrote:
>>>>
>>>>  On 6/2/2011 12:46 PM, John SJ Anderson wrote:
>>>>
>>>>>  On Thu, Jun 2, 2011 at 06:41, venkates<venka...@nt.ntnu.no>    wrote:
>>>>>
>>>>>>  Hi,
>>>>>>
>>>>>>> I want to parse a file with contents that looks as follows:
>>>>>>>
>>>>>>>  [ snip ]
>>>>>>>
>>>>>> Have you considered using this module? ->
>>>>>> <http://search.cpan.org/dist/BioPerl/Bio/SeqIO/kegg.pm>
>>>>>>
>>>>>> Alternatively, I think somebody on the BioPerl mailing list was
>>>>>> working on another KEGG parser...
>>>>>>
>>>>>> chrs,
>>>>>> j.
>>>>>>
>>>>>>  I am doing this as an exercise  to learn parsing techniques so
>>>>>> guidance
>>>>>>
>>>>>>  help needed.
>>>>>
>>>>> Aravind
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> To unsubscribe, e-mail: beginners-unsubscr...@perl.org
>>>>> For additional commands, e-mail: beginners-h...@perl.org
>>>>> http://learn.perl.org/
>>>>>
>>>>>
>>>>>
>>>>>  This is a simple and ugly way of parsing your file:
>>>>>
>>>> use strict;
>>>> use warnings;
>>>> use Carp;
>>>> use Data::Dumper;
>>>>
>>>> my $set = parse("ko");
>>>>
>>>> sub parse {
>>>>  my $keggFile = shift;
>>>>  my $keggHash;
>>>>
>>>>  my $counter = 1;
>>>>
>>>>  open my $fh, '<', $keggFile || croak ("Cannot open file '$keggFile':
>>>> $!");
>>>>  while (<$fh>   ) {
>>>>   chomp;
>>>>   if ( $_ =~ m!///! ) {
>>>>    $counter++;
>>>>    next;
>>>>   }
>>>>
>>>>   if ( $_ =~ /^ENTRY\s+(.+?)\s/sm ) { ${$keggHash}{$counter} = { 'ENTRY'
>>>> =>
>>>> $1 }; }
>>>>
>>>>  While trying a similar thing for DEFINITION record, instead of
>>> appending
>>> current hash with ENTRY and NAME, the DEFINITION record replaces the
>>> contents in the hash?
>>>
>>> $VAR1 = {
>>>          '4' =>  {
>>>                   'DEFINITION' =>  'U18 small nucleolar RNA'
>>>                 },
>>>          '1' =>  {
>>>                   'DEFINITION' =>  'alcohol dehydrogenase [EC:1.1.1.1]'
>>>                 },
>>>          '3' =>  {
>>>                   'DEFINITION' =>  'U14 small nucleolar RNA'
>>>                 },
>>>          '2' =>  {
>>>                   'DEFINITION' =>  'alcohol dehydrogenase (NADP+)
>>> [EC:1.1.1.2]'
>>>                 },
>>>          '5' =>  {
>>>                   'DEFINITION' =>  'U24 small nucleolar RNA'
>>>                 }
>>>        };
>>>
>>> code: in addition to what you had suggested -
>>> if($_ =~ /^DEFINITION\s{2}(.+)?/){
>>>               ${$keggHash}{$counter} = {'DEFINITION' =>  $1};
>>>
>>>           }
>>>
>>>    if ( $_ =~ /^NAME\s+(.*)$/sm ) {
>>>>    my $temp = $1;
>>>>    $temp =~ s/,\s/,/g;
>>>>    my @names = split /,/, $temp;
>>>>    push @{${$keggHash}{$counter}{'NAME'}}, @names;
>>>>   }
>>>>  }
>>>>  close $fh;
>>>>  print Dumper $keggHash;
>>>> }
>>>>
>>>> The output being:
>>>>
>>>> $VAR1 = {
>>>>           '1' =>   {
>>>>                    'NAME' =>   [
>>>>                                'E1.1.1.1',
>>>>                                'adh'
>>>>                              ],
>>>>                    'ENTRY' =>   'K00001'
>>>>                  },
>>>>           '3' =>   {
>>>>                    'NAME' =>   [
>>>>                                'U18snoRNA',
>>>>                                'snR18'
>>>>                              ],
>>>>                    'ENTRY' =>   'K14866'
>>>>                  },
>>>>           '2' =>   {
>>>>                    'NAME' =>   [
>>>>                                'U14snoRNA',
>>>>                                'snR128'
>>>>                              ],
>>>>                    'ENTRY' =>   'K14865'
>>>>                  }
>>>>         };
>>>>
>>>> Which to me looks sort of like what you are looking for.
>>>> The main thing I did was read the file one line at a time to prevent a
>>>> unexpectedly large file from causing memory issues on your machine (in
>>>> the
>>>> end the structure that you are building will cause enough issues
>>>> when handling a large file.
>>>>
>>>> You already dealt with the Entry bit so I'll leave that open though I
>>>> slightly changed the regex but nothing spectacular there.
>>>> The Name bit is simple as I just pull out all of them then then remove
>>>> all
>>>> spaces and split them into an array, feed the array to the hash and hop
>>>> time
>>>> for the next step which is up to you ;-)
>>>>
>>>> I hope it helps you a bit, regards,
>>>>
>>>> Rob
>>>>
>>>>
>>>>  What you do: ${$keggHash}{$counter} = {'DEFINITION' =>  $1};
>> Try the following:   $keggHash}{$counter}{'DEFINITION'} = $1;
>>
>> To make things a little clearer look at the following example.
>>
>> my %hash;
>> $hash{'Key 1'} = { 'Nested Key 1' =>  'Value 1' };
>>
>> What you do is say: $hash{'Key 1'} = { 'Nested Key 2' =>  'Value 2' }
>> What I do is: $hash{'Key 1'}{'Nested Key 2'} = 'Value 2'}
>>
>> In your script you will end up with the following:
>> $VAR1 = {
>>          'Key 1' =>  {
>>                   'Nested Key 2' =>  'Value 2',
>>                 },
>> };
>>
>> Where mine will result in:
>> $VAR1 = {
>>          'Key 1' =>  {
>>                   'Nested Key 1' =>  'Value 1',
>>                   'Nested Key 2' =>  'Value 2',
>>                 },
>> };
>>
>> Not that much different but you are basically over writting the value (
>> {NAME=>[], ENTRY=>''} ) associated with your key ($counter) with {
>> 'DESCRIPTION' =>  ''}. If you instead add a new key to the hash that is
>> associated with your main key ($counter) then you will get the result you
>> are looking for.
>>
>> Regards,
>>
>> Rob
>>
>>
>
>
In that case you need to do various things. First of all you need to
recognise where the PATHWAY segment beings which is easy enough you are
doing that for the NAME DESCRIPTION etc segments. Of course you need to
remember that you are now owrking on the pathway segement (or any multiline
segment to be more flexible). Then all you do is process the lines in the
way you would normally do.

So first of all lets make a $multiline variable before the while loop:
my $multiline;
while ( <$fh> ) {
 chomp;
 if ( $_ =~ m!///! ) {
  $counter++;
  next;
 }

 # If you find the start of any other segement empty the $multiline variable
 if ( $_ =~ /^\w+/ ) {  $multiline = ''; }

 if ( $_ =~ /^PATHWAY\s+(.+?)\s+(.*)/ ) {
  # If we find the PATHWAY segment we set the $multiline variable to
indicate this
  $multiline = 'PATHWAY';
  # Deal with the data found behind the PATHWAY variable and end the
processing of this line.
  ${$keggHash}{$counter}{'PATHWAY'}{$1} = $2;
  next;
 }

 if ( $multiline eq 'PATHWAY' ) {
  $_ =~ /\s+(.+?)\s+(.*)/;
  ${$keggHash}{$counter}{'PATHWAY'}{$1} = $2;
 }

 # Now you can deal with any other lines below just like before
}

Of course if you have other multiline situations simply do the same but this
time you fill the $multiline variable with the name of that segment...
Now I have not tested this so there might be a typo here or there but the
principle is hopefully clear

Regards,

Rob

Re: Parsing file

Reply via email to