On Sat, Apr 23, 2011 at 10:27:07AM -0400, galeb abu-ali wrote: > Hi, > > I'm trying to parse a table containing information about genes in a > bacterial chromosome. Below is a sample for one gene, and there's about 4500 > such blocks in a file: <snip> > My problem is that some genes have 2 entries for COG_category, some only one > and others none. I took a look at perldsc and tried to fit the table into > one of the complex structures but didn't get far. Below is the code I came > up with so far: <snip>
#!/usr/bin/perl # parse_IMG_gene_info.pl use strict; use warnings; # '##' marks the lines of you code I changed/removed. # One of these days you're going to want this program # to report on more than one file. # So wrap 'for my $file ( @ARGV ) { ... }' around it # And open( my $IN, '<', $file or die "Failed to open $file: $!\n"; open( IN, "<", @ARGV ) or die "Failed to open: $!\n"; # Put this print statement down by the other. Keeping functional groups # together aids understanding and it'll all be there when you realize # the task has grown to the point you want an output routine. print "locus\tCOG_category\tCOG_category\tCOGID\tCluster_Information\n\n"; # In Perl it's customary to declare your variables just before you # need them and makes it easier to notice that %locus & $e aren't used my( %locus, @cogs, %cog_cat, %cog_id, $oid, $locus, $source, $cluster_info, $e); # Keeping functional groups together suggests: # open...; while(...); close; # with nothing else intervening. Again $IN. # And while( my $line = <$IN> ) protects you from the many things # that change $_. # If you don't get out of the habit of using $_ it 'will' bite you. # BTW, chomp, split, /COG_category/ and several other functions # act on $_ by default so 'split "\t", $_' and 'split "\t" # are equivalent. note also that 'split /\t/' is preferable. while( <IN> ) { chomp; # remove linefeeds from $cluster_info if( $_=~ /COG_category/ ) { ## ( $oid, $locus, $source, $cluster_info ) = split "\t", $_; # the tabs got lost in the email and it was easier for me to # change the split than change the data file. ( $oid, $locus, $source, $cluster_info ) = split / +/, $_; # When you found the 2nd cog_cat you overwrote the first ## $cog_cat{ $locus } = $cluster_info; push @{ $cog_cat{ $locus } }, $cluster_info if($cluster_info); # You never used this and it's an array of hashes when what I think # you need is an array in your hash ## push( @cogs, { %cog_cat } ); } elsif ( $_=~ /COG\d+/ ) { #( $oid, $locus, $source, $cluster_info ) = split "\t", $_; ( $oid, $locus, $source, $cluster_info ) = split / +/, $_; $cog_id{ $locus } = $cluster_info; } } close IN; #print scalar @cogs, "\n"; # Uncomment and take a look at the output of the next 2 lines # and I think you'll see where you could use a single hash # use Data::Dumper; # print Dumper \%cog_cat, \%cog_id ; for my $test( sort keys %cog_cat ) { ## print "$test\t$cog_cat{ $test }\t$cog_id{ $test }\n"; print $test, map {"\t$_"} @{$cog_cat{ $test }}, "\t$cog_id{ $test }\n"; } print "\n"; __END__ This is the output I got running this against data from one of your earlier posts. I think it's what you're looking for except for an extra tab that I don't see where is coming from. Challenge for the student? :) locus COG_category COG_category COGID Cluster_Information SeSA_B0001 [T] Signal transduction mechanisms [K] Transcription SOS-response transcriptional repressors (RecA-mediated autopeptidases) SeSA_B0002 [L] Replication, recombination and repair Nucleotidyltransferase/DNA polymerase involved in DNA repair HTH, Mike -- Satisfied user of Linux since 1997. O< ascii ribbon campaign - stop html mail - www.asciiribbon.org -- To unsubscribe, e-mail: beginners-unsubscr...@perl.org For additional commands, e-mail: beginners-h...@perl.org http://learn.perl.org/