parse complicated table

galeb abu-ali Sat, 23 Apr 2011 06:26:23 -0700

Hi,

I'm trying to parse a table containing information about genes in a
bacterial chromosome. Below is a sample for several genes, and there's about
4500 such blocks in a file:


gene_oid    Locus Tag    Source    Cluster Information    Gene
Information    E-value
642745051    SeSA_B0001    COG_category    [T] Signal transduction
mechanisms
642745051    SeSA_B0001    COG_category    [K] Transcription
642745051    SeSA_B0001    COG1974    SOS-response transcriptional
repressors (RecA-mediated autopeptidases)        2.0e-29
642745051    SeSA_B0001    pfam00717    Peptidase_S24        1.7e-13
642745051    SeSA_B0001    EC:3.4.21.-    Hydrolases. Acting on peptide
bonds (peptide hydrolases). Serine endopeptidases.
642745051    SeSA_B0001    KO:K03503    DNA polymerase V [EC:3.4.21.-]
    0.0e+00
642745051    SeSA_B0001    ITERM:03797    SOS response UmuD protein. Serine
peptidase. MEROPS family S24
642745051    SeSA_B0001    Locus_type        CDS
642745051    SeSA_B0001    NCBI_accession        YP_002112883
642745051    SeSA_B0001    Product_name        protein SamA
642745051    SeSA_B0001    Scaffold        NC_011092
642745051    SeSA_B0001    Coordinates        34..459(+)
642745051    SeSA_B0001    DNA_length        426bp
642745051    SeSA_B0001    Protein_length        141aa
642745051    SeSA_B0001    GC        .52

642745052    SeSA_B0002    COG_category    [L] Replication, recombination
and repair
642745052    SeSA_B0002    COG0389    Nucleotidyltransferase/DNA polymerase
involved in DNA repair        4.0e-71
642745052    SeSA_B0002    pfam00817    IMS        2.7e-36
642745052    SeSA_B0002    pfam11798    IMS_HHH        6.8e-06
642745052    SeSA_B0002    pfam11799    IMS_C        4.0e-11
642745052    SeSA_B0002    KO:K03502    DNA polymerase V        0.0e+00
642745052    SeSA_B0002    Locus_type        CDS
642745052    SeSA_B0002    NCBI_accession        YP_002112884
642745052    SeSA_B0002    Product_name        protein UmuC
642745052    SeSA_B0002    Scaffold        NC_011092
642745052    SeSA_B0002    Coordinates        459..1730(+)
642745052    SeSA_B0002    DNA_length        1272bp
642745052    SeSA_B0002    Protein_length        423aa
642745052    SeSA_B0002    GC        .57
642745052    SeSA_B0002    Fused_gene        Yes

642745053    SeSA_B0003    pfam02604    PhdYeFM        6.0e-07
642745053    SeSA_B0003    Locus_type        CDS
642745053    SeSA_B0003    NCBI_accession        YP_002112885
642745053    SeSA_B0003    Product_name        antitoxin of toxin-antitoxin
stability system, StbD family
642745053    SeSA_B0003    Scaffold        NC_011092
642745053    SeSA_B0003    Coordinates        1809..2060(+)
642745053    SeSA_B0003    DNA_length        252bp
642745053    SeSA_B0003    Protein_length        83aa
642745053    SeSA_B0003    GC        .51

I want to parse information for Locus_Tag, Source, and Cluster Info for each
gene so that the output table looks like this


locus    COG_category    COG_category    COGID    Cluster_Information

SeSA_B0001    [K] Transcription    COG1974    SOS-response transcriptional
repressors (RecA-mediated autopeptidases)
SeSA_B0002    [L] Replication, recombination and repair    COG0389
Nucleotidyltransferase/DNA polymerase involved in DNA repair


My problem is that some genes have 2 entries for COG_category, some only one
and others none. I took a look at perldsc and tried to fit the table into
one of the complex structures but didn't get far. Below is the code I came
up with so far:

#!/usr/bin/perl
# parse_IMG_gene_info.pl
use strict; use warnings;


open( IN, "<", @ARGV ) or die "Failed to open: $!\n";

print "locus\tCOG_category\tCOG_category\tCOGID\tCluster_Information\n\n";

my( %locus, @cogs, %cog_cat, %cog_id, $oid, $locus, $source, $cluster_info,
$e );

while( <IN> ) {
    if( $_=~ /COG_category/ ) {
        ( $oid, $locus, $source, $cluster_info ) = split "\t", $_;
        $cog_cat{ $locus } =  $cluster_info;
        push( @cogs, { %cog_cat } );
    } elsif ( $_=~ /COG\d+/ ) {
        ( $oid, $locus, $source, $cluster_info ) = split "\t", $_;
        $cog_id{ $locus } =  $cluster_info;
    }
}

close IN;

#print scalar @cogs, "\n";

for my $test( sort keys %cog_cat ) {
    print "$test\t$cog_cat{ $test }\t$cog_id{ $test }\n";
}
print "\n";



Your insight is greatly appreciated!

galeb

parse complicated table

Reply via email to