On 09/06/2011 09:48, venkates wrote: > Hi, > > data snippet: > > ENTRY K00002 KO > NAME E1.1.1.2, adh > DEFINITION alcohol dehydrogenase (NADP+) [EC:1.1.1.2] > PATHWAY ko00010 Glycolysis / Gluconeogenesis > ko00561 Glycerolipid metabolism > ko00930 Caprolactam degradation > CLASS Metabolism; Carbohydrate Metabolism; Glycolysis / Gluconeogenesis > [PATH:ko00010] > Metabolism; Lipid Metabolism; Glycerolipid metabolism [PATH:ko00561] > Metabolism; Xenobiotics Biodegradation and Metabolism; Caprolactam > degradation [PATH:ko00930] > DBLINKS RN: R00746 R01041 R05231 > COG: COG0656 > GO: 0008106 > GENES HSA: 10327(AKR1A1) > PTR: 741418(AKR1A1) > PON: 100173796(AKR1A1) > MCC: 693380(AKR1A1) > MMU: 58810(Akr1a4) > RNO: 78959(Akr1a1) > CFA: 610537 > /// > ENTRY K00730 KO > NAME OST4 > DEFINITION oligosaccharyl transferase complex subunit OST4 > PATHWAY ko00510 N-Glycan biosynthesis > ko00513 Various types of N-glycan biosynthesis > ko04141 Protein processing in endoplasmic reticulum > MODULE M00072 Oligosaccharyltransferase > CLASS Metabolism; Glycan Biosynthesis and Metabolism; N-Glycan > biosynthesis [PATH:ko00510] > Metabolism; Glycan Biosynthesis and Metabolism; Various types of > N-glycan biosynthesis [PATH:ko00513] > Genetic Information Processing; Folding, Sorting and Degradation; > Protein processing in endoplasmic reticulum [PATH:ko04141] > DBLINKS GO: 0008250 > GENES SCE: YDL232W(OST4) > AGO: AGOS_ABL170C > KLA: KLLA0A01287g > VPO: Kpol_1054p35 > SSL: SS1G_13465 > REFERENCE PMID:15001703 > AUTHORS Zubkov S, Lennarz WJ, Mohanty S > TITLE Structural basis for the function of a minimembrane protein > subunit of yeast oligosaccharyltransferase. > JOURNAL Proc Natl Acad Sci U S A 101:3821-6 (2004) > /// > > I need to retrieve all the gene entries to add it to a hash ref. My code > does that in the first record but in the second case it also pulls out > the REFERENCE information. I have provided the code below. If some one > could tell me where exactly I am going wrong (is it in the regex? or > otherwise) I would be glad!! > > code : > > use strict; > use warnings; > use Carp; > use Data::Dumper; > > > my $set = parse("/home/venkates/workspace/KEGG_Parser/data/ko"); > > sub parse { > > my $kegg_file_path = shift; > my $keggData; # Hash ref > > open my $fh, '<', $kegg_file_path or croak("Cannot open file > '$kegg_file_path': $!"); > local $/ = "\n///\n"; > while (<$fh>){ > chomp; > my $record = $_; > $record =~ m/^ENTRY\s{7}(.+?)\s+/xms; > my $entries = $1; > if ($record =~ m/^GENES\s{7}(.+)$/xms){ > my $gene = $1; > ${$keggData}{$entries}{'GENE'} = $gene; > my @genes = split ('\s{13}', $gene); > foreach my $gene_element (@genes){ > my $taxon_label = substr($gene_element, 0, 3); > my $gene_label = substr($gene_element, 5); > my @gene_label_array = split '\s', $gene_label; > push @{${$keggData}{$entries}{'GENES'}{$taxon_label}}, @gene_label_array; > } > } > > } > print Dumper($keggData); > close $fh; > }
I would prefer to read the file a line at a time. The code below seems to do what you want. HTH, Rob use strict; use warnings; use Data::Dumper; my $kegg_file = '/home/venkates/workspace/KEGG_Parser/data/ko'; my $fh; unless (open $fh, $kegg_file) { warn "Failed to open file: $!. Defaulting to DATA."; $fh = *DATA; } parse($fh); sub parse { my $kegg_file_handle = shift; my $keggData; my $entry; my $key; while (<$fh>) { next unless /\S/; if (m|///|) { undef $entry; undef $key; next; } chomp; next unless m|^(.{0,11}?)\s+(.+)|; $key = $1 if $1; my $val = $2; if ($key eq 'ENTRY') { ($entry) = $val =~ /(\S+)/; } elsif ($key eq 'GENES') { die "No current entry" unless $entry; my ($taxon_label, @gene_label_array) = split /:?\s+/, $val; push @{$keggData->{$entry}{$key}{$taxon_label}}, @gene_label_array; } } print Dumper($keggData); } __DATA__ ENTRY K00002 KO NAME E1.1.1.2, adh DEFINITION alcohol dehydrogenase (NADP+) [EC:1.1.1.2] PATHWAY ko00010 Glycolysis / Gluconeogenesis ko00561 Glycerolipid metabolism ko00930 Caprolactam degradation CLASS Metabolism; Carbohydrate Metabolism; Glycolysis / Gluconeogenesis [PATH:ko00010] Metabolism; Lipid Metabolism; Glycerolipid metabolism [PATH:ko00561] Metabolism; Xenobiotics Biodegradation and Metabolism; Caprolactam degradation [PATH:ko00930] DBLINKS RN: R00746 R01041 R05231 COG: COG0656 GO: 0008106 GENES HSA: 10327(AKR1A1) PTR: 741418(AKR1A1) PON: 100173796(AKR1A1) MCC: 693380(AKR1A1) MMU: 58810(Akr1a4) RNO: 78959(Akr1a1) CFA: 610537 /// ENTRY K00730 KO NAME OST4 DEFINITION oligosaccharyl transferase complex subunit OST4 PATHWAY ko00510 N-Glycan biosynthesis ko00513 Various types of N-glycan biosynthesis ko04141 Protein processing in endoplasmic reticulum MODULE M00072 Oligosaccharyltransferase CLASS Metabolism; Glycan Biosynthesis and Metabolism; N-Glycan biosynthesis [PATH:ko00510] Metabolism; Glycan Biosynthesis and Metabolism; Various types of N-glycan biosynthesis [PATH:ko00513] Genetic Information Processing; Folding, Sorting and Degradation; Protein processing in endoplasmic reticulum [PATH:ko04141] DBLINKS GO: 0008250 GENES SCE: YDL232W(OST4) AGO: AGOS_ABL170C KLA: KLLA0A01287g VPO: Kpol_1054p35 SSL: SS1G_13465 REFERENCE PMID:15001703 AUTHORS Zubkov S, Lennarz WJ, Mohanty S TITLE Structural basis for the function of a minimembrane protein subunit of yeast oligosaccharyltransferase. JOURNAL Proc Natl Acad Sci U S A 101:3821-6 (2004) /// -- To unsubscribe, e-mail: beginners-unsubscr...@perl.org For additional commands, e-mail: beginners-h...@perl.org http://learn.perl.org/