Hi, I'm trying to parse a table containing information about genes in a bacterial chromosome. Below is a sample for several genes, and there's about 4500 such blocks in a file:
gene_oid Locus Tag Source Cluster Information Gene Information E-value 642745051 SeSA_B0001 COG_category [T] Signal transduction mechanisms 642745051 SeSA_B0001 COG_category [K] Transcription 642745051 SeSA_B0001 COG1974 SOS-response transcriptional repressors (RecA-mediated autopeptidases) 2.0e-29 642745051 SeSA_B0001 pfam00717 Peptidase_S24 1.7e-13 642745051 SeSA_B0001 EC:3.4.21.- Hydrolases. Acting on peptide bonds (peptide hydrolases). Serine endopeptidases. 642745051 SeSA_B0001 KO:K03503 DNA polymerase V [EC:3.4.21.-] 0.0e+00 642745051 SeSA_B0001 ITERM:03797 SOS response UmuD protein. Serine peptidase. MEROPS family S24 642745051 SeSA_B0001 Locus_type CDS 642745051 SeSA_B0001 NCBI_accession YP_002112883 642745051 SeSA_B0001 Product_name protein SamA 642745051 SeSA_B0001 Scaffold NC_011092 642745051 SeSA_B0001 Coordinates 34..459(+) 642745051 SeSA_B0001 DNA_length 426bp 642745051 SeSA_B0001 Protein_length 141aa 642745051 SeSA_B0001 GC .52 642745052 SeSA_B0002 COG_category [L] Replication, recombination and repair 642745052 SeSA_B0002 COG0389 Nucleotidyltransferase/DNA polymerase involved in DNA repair 4.0e-71 642745052 SeSA_B0002 pfam00817 IMS 2.7e-36 642745052 SeSA_B0002 pfam11798 IMS_HHH 6.8e-06 642745052 SeSA_B0002 pfam11799 IMS_C 4.0e-11 642745052 SeSA_B0002 KO:K03502 DNA polymerase V 0.0e+00 642745052 SeSA_B0002 Locus_type CDS 642745052 SeSA_B0002 NCBI_accession YP_002112884 642745052 SeSA_B0002 Product_name protein UmuC 642745052 SeSA_B0002 Scaffold NC_011092 642745052 SeSA_B0002 Coordinates 459..1730(+) 642745052 SeSA_B0002 DNA_length 1272bp 642745052 SeSA_B0002 Protein_length 423aa 642745052 SeSA_B0002 GC .57 642745052 SeSA_B0002 Fused_gene Yes 642745053 SeSA_B0003 pfam02604 PhdYeFM 6.0e-07 642745053 SeSA_B0003 Locus_type CDS 642745053 SeSA_B0003 NCBI_accession YP_002112885 642745053 SeSA_B0003 Product_name antitoxin of toxin-antitoxin stability system, StbD family 642745053 SeSA_B0003 Scaffold NC_011092 642745053 SeSA_B0003 Coordinates 1809..2060(+) 642745053 SeSA_B0003 DNA_length 252bp 642745053 SeSA_B0003 Protein_length 83aa 642745053 SeSA_B0003 GC .51 I want to parse information for Locus_Tag, Source, and Cluster Info for each gene so that the output table looks like this locus COG_category COG_category COGID Cluster_Information SeSA_B0001 [K] Transcription COG1974 SOS-response transcriptional repressors (RecA-mediated autopeptidases) SeSA_B0002 [L] Replication, recombination and repair COG0389 Nucleotidyltransferase/DNA polymerase involved in DNA repair My problem is that some genes have 2 entries for COG_category, some only one and others none. I took a look at perldsc and tried to fit the table into one of the complex structures but didn't get far. Below is the code I came up with so far: #!/usr/bin/perl # parse_IMG_gene_info.pl use strict; use warnings; open( IN, "<", @ARGV ) or die "Failed to open: $!\n"; print "locus\tCOG_category\tCOG_category\tCOGID\tCluster_Information\n\n"; my( %locus, @cogs, %cog_cat, %cog_id, $oid, $locus, $source, $cluster_info, $e ); while( <IN> ) { if( $_=~ /COG_category/ ) { ( $oid, $locus, $source, $cluster_info ) = split "\t", $_; $cog_cat{ $locus } = $cluster_info; push( @cogs, { %cog_cat } ); } elsif ( $_=~ /COG\d+/ ) { ( $oid, $locus, $source, $cluster_info ) = split "\t", $_; $cog_id{ $locus } = $cluster_info; } } close IN; #print scalar @cogs, "\n"; for my $test( sort keys %cog_cat ) { print "$test\t$cog_cat{ $test }\t$cog_id{ $test }\n"; } print "\n"; Your insight is greatly appreciated! galeb