Hi guys, Thanks a lot for this issue and with Anjan's help here I got the wonderful script for my work and I post below, hopefully it would be helpful for others--works perfectly for fasta file. Also thanks to others who are interested in the discussion, but I think it takes time for me to digest those "weird" symbols-----Tack Så Mycket!!! Have a nice day Changrong
usage: perl seqComp.pl <input_fasta_file> > <output_file> #! /usr/bin/perl -w use strict; open (S, "$ARGV[0]") || die "cannot open FASTA file to read: $!"; my %s;# a hash of arrays, to hold each line of sequence my %seq; #a hash to hold the AA sequences. my $key; while (<S>){ #Read the FASTA file. chomp; if (/>/){ s/>//; $key= $_; }else{ push (@{$s{$key}}, $_); } } foreach my $a (keys %s){ my $s= join("", @{$s{$a}}); $seq{$a}=$s; #print("$a\t$s\n"); } my @aa= qw(A R N D C Q E G H I L K M F P S T W Y V); my $aa= join("\t", @aa); print ("Sequence\t$aa\n"); foreach my $k (keys %seq){ my %count; # a hash to hold the count for each amino acid in the protein. my @seq= split(//, $seq{$k}); foreach my $r(@seq){ $count{$r}++; } my @row; push(@row, $k); foreach my $a (@aa){ $count{$a}||=0; $count{$a}= sprintf("%0.2f", $count{$a}/length($seq{$k})); push(@row,$count{$a}); } my $row= join("\t",@row); print("$row\n"); } On Wed, Dec 1, 2010 at 8:56 PM, ANJAN PURKAYASTHA < anjan.purkayas...@gmail.com> wrote: > here is the script. give it a name, say, seqComp.pl. usage: perl seqComp.pl > <input_FASTA_file>. > HTH, > Anjan > > #! /usr/bin/perl -w > use strict; > > open (S, "$ARGV[0]") || die "cannot open FASTA file to read: $!"; > > my %s;# a hash of arrays, to hold each line of sequence > my %seq; #a hash to hold the AA sequences. > my $key; > > while (<S>){ #Read the FASTA file. > chomp; > > if (/>/){ > s/>//; > $key= $_; > }else{ > push (@{$s{$key}}, $_); > } > > } > > foreach my $a (keys %s){ > my $s= join("", @{$s{$a}}); > $seq{$a}=$s; > #print("$a\t$s\n"); > } > > my @aa= qw(A R N D C Q E G H I L K M F P S T W Y V); > > foreach my $k (keys %seq){ > my %count; # a hash to hold the count for each amino acid in the > protein. > my @seq= split(//, $seq{$k}); > foreach my $r(@seq){ > $count{$r}++; > } > print("$k\n"); > foreach my $a (@aa){ > $count{$a}||=0; > $count{$a}= sprintf("%0.2f", $count{$a}/length($seq{$k})); > print("$a\t$count{$a}\n"); > } > > } > > On Wed, Dec 1, 2010 at 2:31 PM, Rob Dixon <rob.di...@gmx.com> wrote: > >> On 01/12/2010 08:44, Changrong Ge wrote: >> >>> >>> I am quite new to this perl language-I am from biochemistry field. >>> Now trying to write a script for my current work but could not make >>> it. The idea is to calculate the composition (percentage) of amino >>> acids in a protein sequence. >>> >>> Input is a series of fasta format (protein sequence) output is a tab >>> delimited format like below: > >>> >>> Name A T C D N Q E ....... >>> protein1 0.23 0.40 0.20 ... >>> protein2 0.52 0.01 .... >>> protein3 >>> ...... >>> Could somebody help me with this? I tried reading some books like >>> perl for bioinformatics, but still not into it. >>> >> >> Hi Changrong >> >> The problem doesn't seem difficult, but I'm afraid we don't have much >> knowledge of bioinformatics between us. If you post a sample of input >> data and the corresponding output you desire then I am sure we can help. >> >> Regards, >> >> Rob >> >> >> -- >> To unsubscribe, e-mail: beginners-unsubscr...@perl.org >> For additional commands, e-mail: beginners-h...@perl.org >> http://learn.perl.org/ >> >> >> > > > -- > =================================== > anjan purkayastha, phd. > research associate > fas center for systems biology, > harvard university > 52 oxford street > cambridge ma 02138 > phone-703.740.6939 > =================================== > -- Changrong Ge, PhD student, Center for Biomembrane Research, Dept.of Biochemistry & Biophysics, Svante Arrheniusväg 16, Stockholm University, SE-10691 Stockholm, Sweden. Tel: +46-(0)8-16 24 87, Mobile: +46-(0)76-2176 119 Fax: +46-(0)8-16 2487, Email: changr...@dbb.su.se or changrong...@gmail.com