Mike,

many thanks for the help, particularly your comments which were as valuable
as your code!

I revised the code to the following:

#!/usr/bin/perl
# parse_IMG_gene_info4.pl
use strict; use warnings;


for my $file( @ARGV ) {
    open( my $IN, "<", $file ) or die "Failed to open: $!\n";

    my( %cog_cat, %cog_id, @cogs, $oid, $locus, $source, $cluster_info );

    while( my $line = <$IN> ) {
        chomp $line;

        if( $line=~ /COG_category/ ) {
            ( $oid, $locus, $source, $cluster_info ) = split /\t/, $line;

            push @{ $cog_cat{ $locus } }, $cluster_info if( $cluster_info );

        } elsif ( $line=~ /COG\d+/ ) {
            ( $oid, $locus, $source, $cluster_info ) = split /\t/, $line;
            push @{ $cog_id{ $locus } }, $source if( $source );
            push @{ $cog_id{ $locus } }, $cluster_info if( $cluster_info );
        }
    }

    close $IN;
   #use Data::Dumper;

   #print Dumper \%cog_cat, \%cog_id;

    print
"Locus\tCOG_category\tCOG_category\tCOG_ID\tCluster_information\tCOG_ID\tCluster_information\n\n";

    for my $test( sort keys %cog_cat ) {
        print $test, map {"\t$_"} @{ $cog_cat{ $test } }; #\t$cog_id{ $test
}

        if( scalar @{ $cog_cat{ $test } } > 1 ) {
            print map {"\t$_"} @{ $cog_id{ $test } }, "\n";
        } else {
            print "\t", map {"\t$_"} @{ $cog_id{ $test } }, "\n";
        }

    }
    print "\n";
}

I ended up using 2 hashes instead of 1 to include more information, but
you're right that 1 would suffice in the initial design. The extra tab went
away, no idea how. I also added a tab after COG_category, so that it looks
like this now:


  Locus COG_category COG_category COG_ID Cluster_information COG_ID
Cluster_information






 SeSA_B0001 [T] Signal transduction mechanisms [K] Transcription
COG1974 SOS-response
transcriptional repressors (RecA-mediated autopeptidases)

 SeSA_B0002 [L] Replication, recombination and repair
COG0389 Nucleotidyltransferase/DNA polymerase involved in DNA repair





Many thanks again! Must've spent ~ 4 days on this. I've been flirting with
Perl less than a year, it's so seductive I find myself debating whether to
go back to school.

cheers

galeb





On Sat, Apr 23, 2011 at 3:30 PM, Mike McClain <mike.j...@cox.net> wrote:

> On Sat, Apr 23, 2011 at 10:27:07AM -0400, galeb abu-ali wrote:
> > Hi,
> >
> > I'm trying to parse a table containing information about genes in a
> > bacterial chromosome. Below is a sample for one gene, and there's about
> 4500
> > such blocks in a file:
> <snip>
> > My problem is that some genes have 2 entries for COG_category, some only
> one
> > and others none. I took a look at perldsc and tried to fit the table into
> > one of the complex structures but didn't get far. Below is the code I
> came
> > up with so far:
> <snip>
>
> #!/usr/bin/perl
> # parse_IMG_gene_info.pl
> use strict; use warnings;
>
> #   '##' marks the lines of you code I changed/removed.
>
> #   One of these days you're going to want this program
> #       to report on more than one file.
> #   So wrap 'for my $file ( @ARGV ) { ... }' around it
> #   And open( my $IN, '<', $file   or die "Failed to open $file: $!\n";
>
> open( IN, "<", @ARGV ) or die "Failed to open: $!\n";
>
> #   Put this print statement down by the other. Keeping functional groups
> #   together aids understanding and it'll all be there when you realize
> #   the task has grown to the point you want an output routine.
>
> print "locus\tCOG_category\tCOG_category\tCOGID\tCluster_Information\n\n";
>
> #   In Perl it's customary to declare your variables just before you
> #   need them and makes it easier to notice that %locus & $e aren't used
> my( %locus, @cogs, %cog_cat, %cog_id, $oid, $locus, $source,
>        $cluster_info, $e);
>
> #   Keeping functional groups together suggests:
> #       open...; while(...); close;
> #   with nothing else intervening.  Again $IN.
> #   And while( my $line = <$IN> ) protects you from the many things
> #       that change $_.
> #   If you don't get out of the habit of using $_ it 'will' bite you.
> #   BTW, chomp, split, /COG_category/ and several other functions
> #       act on $_ by default so 'split "\t", $_' and 'split "\t"
> #       are equivalent.     note also that 'split /\t/' is preferable.
> while( <IN> ) {
>    chomp;                          #   remove linefeeds from $cluster_info
>     if( $_=~ /COG_category/ ) {
> ##        ( $oid, $locus, $source, $cluster_info ) = split "\t", $_;
>
> #   the tabs got lost in the email and it was easier for me to
> #       change the split than change the data file.
>        ( $oid, $locus, $source, $cluster_info ) = split /  +/, $_;
> #   When you found the 2nd cog_cat you overwrote the first
> ##        $cog_cat{ $locus } =  $cluster_info;
>        push @{ $cog_cat{ $locus } }, $cluster_info     if($cluster_info);
> #   You never used this and it's an array of hashes when what I think
> #       you need is an array in your hash
> ##        push( @cogs, { %cog_cat } );
>     } elsif ( $_=~ /COG\d+/ ) {
>        #( $oid, $locus, $source, $cluster_info ) = split "\t", $_;
>         ( $oid, $locus, $source, $cluster_info ) = split /  +/, $_;
>         $cog_id{ $locus } =  $cluster_info;
>    }
> }
> close IN;
>
> #print scalar @cogs, "\n";
>
> #   Uncomment and take a look at the output of the next 2 lines
> #   and I think you'll see where you could use a single hash
> # use Data::Dumper;
> # print Dumper \%cog_cat, \%cog_id ;
>
> for my $test( sort keys %cog_cat ) {
> ##    print "$test\t$cog_cat{ $test }\t$cog_id{ $test }\n";
>     print $test, map {"\t$_"} @{$cog_cat{ $test }}, "\t$cog_id{ $test }\n";
> }
> print "\n";
> __END__
> This is the output I got running this against data from one of your
> earlier posts. I think it's what you're looking for except for
> an extra tab that I don't see where is coming from.
> Challenge for the student? :)
>
> locus   COG_category    COG_category    COGID   Cluster_Information
>
> SeSA_B0001      [T] Signal transduction mechanisms      [K] Transcription
>    SOS-response transcriptional repressors (RecA-mediated autopeptidases)
> SeSA_B0002      [L] Replication, recombination and repair
> Nucleotidyltransferase/DNA polymerase involved in DNA repair
>
> HTH,
> Mike
> --
> Satisfied user of Linux since 1997.
> O< ascii ribbon campaign - stop html mail - www.asciiribbon.org
>
> --
> To unsubscribe, e-mail: beginners-unsubscr...@perl.org
> For additional commands, e-mail: beginners-h...@perl.org
> http://learn.perl.org/
>
>
>

Reply via email to