Mike, many thanks for the help, particularly your comments which were as valuable as your code!
I revised the code to the following: #!/usr/bin/perl # parse_IMG_gene_info4.pl use strict; use warnings; for my $file( @ARGV ) { open( my $IN, "<", $file ) or die "Failed to open: $!\n"; my( %cog_cat, %cog_id, @cogs, $oid, $locus, $source, $cluster_info ); while( my $line = <$IN> ) { chomp $line; if( $line=~ /COG_category/ ) { ( $oid, $locus, $source, $cluster_info ) = split /\t/, $line; push @{ $cog_cat{ $locus } }, $cluster_info if( $cluster_info ); } elsif ( $line=~ /COG\d+/ ) { ( $oid, $locus, $source, $cluster_info ) = split /\t/, $line; push @{ $cog_id{ $locus } }, $source if( $source ); push @{ $cog_id{ $locus } }, $cluster_info if( $cluster_info ); } } close $IN; #use Data::Dumper; #print Dumper \%cog_cat, \%cog_id; print "Locus\tCOG_category\tCOG_category\tCOG_ID\tCluster_information\tCOG_ID\tCluster_information\n\n"; for my $test( sort keys %cog_cat ) { print $test, map {"\t$_"} @{ $cog_cat{ $test } }; #\t$cog_id{ $test } if( scalar @{ $cog_cat{ $test } } > 1 ) { print map {"\t$_"} @{ $cog_id{ $test } }, "\n"; } else { print "\t", map {"\t$_"} @{ $cog_id{ $test } }, "\n"; } } print "\n"; } I ended up using 2 hashes instead of 1 to include more information, but you're right that 1 would suffice in the initial design. The extra tab went away, no idea how. I also added a tab after COG_category, so that it looks like this now: Locus COG_category COG_category COG_ID Cluster_information COG_ID Cluster_information SeSA_B0001 [T] Signal transduction mechanisms [K] Transcription COG1974 SOS-response transcriptional repressors (RecA-mediated autopeptidases) SeSA_B0002 [L] Replication, recombination and repair COG0389 Nucleotidyltransferase/DNA polymerase involved in DNA repair Many thanks again! Must've spent ~ 4 days on this. I've been flirting with Perl less than a year, it's so seductive I find myself debating whether to go back to school. cheers galeb On Sat, Apr 23, 2011 at 3:30 PM, Mike McClain <mike.j...@cox.net> wrote: > On Sat, Apr 23, 2011 at 10:27:07AM -0400, galeb abu-ali wrote: > > Hi, > > > > I'm trying to parse a table containing information about genes in a > > bacterial chromosome. Below is a sample for one gene, and there's about > 4500 > > such blocks in a file: > <snip> > > My problem is that some genes have 2 entries for COG_category, some only > one > > and others none. I took a look at perldsc and tried to fit the table into > > one of the complex structures but didn't get far. Below is the code I > came > > up with so far: > <snip> > > #!/usr/bin/perl > # parse_IMG_gene_info.pl > use strict; use warnings; > > # '##' marks the lines of you code I changed/removed. > > # One of these days you're going to want this program > # to report on more than one file. > # So wrap 'for my $file ( @ARGV ) { ... }' around it > # And open( my $IN, '<', $file or die "Failed to open $file: $!\n"; > > open( IN, "<", @ARGV ) or die "Failed to open: $!\n"; > > # Put this print statement down by the other. Keeping functional groups > # together aids understanding and it'll all be there when you realize > # the task has grown to the point you want an output routine. > > print "locus\tCOG_category\tCOG_category\tCOGID\tCluster_Information\n\n"; > > # In Perl it's customary to declare your variables just before you > # need them and makes it easier to notice that %locus & $e aren't used > my( %locus, @cogs, %cog_cat, %cog_id, $oid, $locus, $source, > $cluster_info, $e); > > # Keeping functional groups together suggests: > # open...; while(...); close; > # with nothing else intervening. Again $IN. > # And while( my $line = <$IN> ) protects you from the many things > # that change $_. > # If you don't get out of the habit of using $_ it 'will' bite you. > # BTW, chomp, split, /COG_category/ and several other functions > # act on $_ by default so 'split "\t", $_' and 'split "\t" > # are equivalent. note also that 'split /\t/' is preferable. > while( <IN> ) { > chomp; # remove linefeeds from $cluster_info > if( $_=~ /COG_category/ ) { > ## ( $oid, $locus, $source, $cluster_info ) = split "\t", $_; > > # the tabs got lost in the email and it was easier for me to > # change the split than change the data file. > ( $oid, $locus, $source, $cluster_info ) = split / +/, $_; > # When you found the 2nd cog_cat you overwrote the first > ## $cog_cat{ $locus } = $cluster_info; > push @{ $cog_cat{ $locus } }, $cluster_info if($cluster_info); > # You never used this and it's an array of hashes when what I think > # you need is an array in your hash > ## push( @cogs, { %cog_cat } ); > } elsif ( $_=~ /COG\d+/ ) { > #( $oid, $locus, $source, $cluster_info ) = split "\t", $_; > ( $oid, $locus, $source, $cluster_info ) = split / +/, $_; > $cog_id{ $locus } = $cluster_info; > } > } > close IN; > > #print scalar @cogs, "\n"; > > # Uncomment and take a look at the output of the next 2 lines > # and I think you'll see where you could use a single hash > # use Data::Dumper; > # print Dumper \%cog_cat, \%cog_id ; > > for my $test( sort keys %cog_cat ) { > ## print "$test\t$cog_cat{ $test }\t$cog_id{ $test }\n"; > print $test, map {"\t$_"} @{$cog_cat{ $test }}, "\t$cog_id{ $test }\n"; > } > print "\n"; > __END__ > This is the output I got running this against data from one of your > earlier posts. I think it's what you're looking for except for > an extra tab that I don't see where is coming from. > Challenge for the student? :) > > locus COG_category COG_category COGID Cluster_Information > > SeSA_B0001 [T] Signal transduction mechanisms [K] Transcription > SOS-response transcriptional repressors (RecA-mediated autopeptidases) > SeSA_B0002 [L] Replication, recombination and repair > Nucleotidyltransferase/DNA polymerase involved in DNA repair > > HTH, > Mike > -- > Satisfied user of Linux since 1997. > O< ascii ribbon campaign - stop html mail - www.asciiribbon.org > > -- > To unsubscribe, e-mail: beginners-unsubscr...@perl.org > For additional commands, e-mail: beginners-h...@perl.org > http://learn.perl.org/ > > >