"John W. Krahn" wrote: > > Pedro Antonio Reche wrote: > > > > Hi, I am interested in parsing the file at the bottom of this e-mail in > > order to extract the string between "" following /product=, > > /protein_id=, /db_xref= and /translation=, and that for each of the > > segment separated by the string "CDS". The ouptput for the example > > bellow should look like this: > > > > >V001|AAM13451.1|GI:20152990 > > MESLKYFYSLSLSLFNGLTKILNLFLMESLKYFYSLSLSLFNGL > > TKILNLFLMVSIKRSIFLTL > > >V002|AAA60951.1|GI:333518 > > KQIVLACICLAAVAIPTSLQQSFSSSSSCTEEENKHHMGIDVI > > IKVTKQDQTPTNDKICQSVTEVTESEDESEEVVKGDPTTYYTVVGGGLTMDFGFTKCP > > KISSISEYSDGNTVNARLSSVSPGQGKDSPAITREEALSMIKDCEMSINIKCSEEEKD > > SNIKTHPVLGSNISHKKVSYEDIIGSTIVDTKCVKNLEISVRIGDMCKESSELEVKDG > > FKYVDGSASEDAADDTSLINSAKLIACV > > > > So far I have use the code below which actually work. However, I am not > > please with it, as it generates an empty element in the hash from the > > header of the file and becasue that there might be a better way to do > > this. Thereby, I will be very pleased for any input or alternative way > > to improve the code. > > Regards, > > pedro > > #!/usr/sbin/perl -w > > $/ = "\n CDS"; > > while(<>){ > > $_ =~ /product=\"(.+)\"/; > > $gname = $1; > > $gname =~ s/\s+//g; > > push @ID, $gname; > > $_ =~ /protein_id="([\w\.]+)\"/; > > $ref = $1; > > $_=~ /db_xref=\"GI:(\w+)\"/; > > $gid = $1; > > $_ =~ /translation=\"([A-Z\s]+)/; > > $seq = $1; > > $seq =~ s/\s+//g; > > $hash{$gname} = ["$ref", "$gid", "$seq"]; > > } > > open(F, ">test"); > > foreach $key (@ID){ > > print F ">gi|$hash{$key}[1]|$hash{$key}[0] > > $key\n$hash{$key}[2]\n"; > > } > > close(F); > > > > [snip] > > You probably want something like this: > > #!/usr/sbin/perl -w > use strict; > > $/ = "\n CDS"; Dear Krahn, the code you send is bassically what I have. Moreover, it gives the same problem, an empty record from the header of the file above the first CDS. Anyway, thanks a lot for the try. Best Pedro
> > open F, '>test' or die "Cannot open 'test' $!"; > > while ( <> ) { > my ($gname) = /product="([^"]+)"/; > $gname =~ s/\s+//g; > my ($ref) = /protein_id="([\w.]+)"/; > my ($gid) = /db_xref="(GI:\w+)"/; > my ($seq) = /translation="([A-Z\s]+)"/; > $seq =~ s/\s+//g; > > print F "$gname|$ref|$gid\n$seq\n"; > } > > close F; > > John > -- > use Perl; > program > fulfillment > > -- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] -- ******************************************************************* PEDRO A. RECHE , pHD TL: 617 632 3824 Dana-Farber Cancer Institute, FX: 617 632 4569 Harvard Medical School, EM: [EMAIL PROTECTED] 44 Binney Street, D1510A, EM: [EMAIL PROTECTED] Boston, MA 02115 URL: http://www.reche.org ******************************************************************* -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]