"John W. Krahn" wrote:
> 
> Pedro Antonio Reche wrote:
> >
> > Hi, I am interested in parsing the file at the bottom of this e-mail in
> > order to extract the string between "" following  /product=,
> > /protein_id=, /db_xref= and  /translation=, and that for each of the
> > segment separated by the string "CDS". The ouptput for the example
> > bellow should look like this:
> >
> > >V001|AAM13451.1|GI:20152990
> > MESLKYFYSLSLSLFNGLTKILNLFLMESLKYFYSLSLSLFNGL
> > TKILNLFLMVSIKRSIFLTL
> > >V002|AAA60951.1|GI:333518
> > KQIVLACICLAAVAIPTSLQQSFSSSSSCTEEENKHHMGIDVI
> > IKVTKQDQTPTNDKICQSVTEVTESEDESEEVVKGDPTTYYTVVGGGLTMDFGFTKCP
> > KISSISEYSDGNTVNARLSSVSPGQGKDSPAITREEALSMIKDCEMSINIKCSEEEKD
> > SNIKTHPVLGSNISHKKVSYEDIIGSTIVDTKCVKNLEISVRIGDMCKESSELEVKDG
> > FKYVDGSASEDAADDTSLINSAKLIACV
> >
> > So far I have use the code below which actually work. However, I am not
> > please with it, as it generates an empty element in the hash from the
> > header of the file and becasue that there might be a better way to do
> > this. Thereby, I will be very pleased for any input or alternative way
> > to improve the code.
> > Regards,
> > pedro
> > #!/usr/sbin/perl -w
> > $/ = "\n     CDS";
> > while(<>){
> >         $_ =~ /product=\"(.+)\"/;
> >                 $gname = $1;
> >                 $gname =~ s/\s+//g;
> >                 push @ID, $gname;
> >         $_ =~ /protein_id="([\w\.]+)\"/;
> >                 $ref = $1;
> >         $_=~ /db_xref=\"GI:(\w+)\"/;
> >                 $gid = $1;
> >         $_ =~ /translation=\"([A-Z\s]+)/;
> >                 $seq = $1;
> >                 $seq  =~ s/\s+//g;
> >                $hash{$gname} = ["$ref", "$gid", "$seq"];
> > }
> > open(F, ">test");
> > foreach $key (@ID){
> >         print F ">gi|$hash{$key}[1]|$hash{$key}[0]
> > $key\n$hash{$key}[2]\n";
> > }
> > close(F);
> >
> > [snip]
> 
> You probably want something like this:
> 
> #!/usr/sbin/perl -w
> use strict;
> 
> $/ = "\n     CDS";
Dear Krahn, the code you send is bassically what I have. Moreover, it
gives the same problem, an empty record from the header of the file
above the first CDS. Anyway, thanks a lot for the try. 
Best
Pedro 

> 
> open F, '>test' or die "Cannot open 'test' $!";
> 
> while ( <> ) {
>     my ($gname) = /product="([^"]+)"/;
>     $gname      =~ s/\s+//g;
>     my ($ref)   = /protein_id="([\w.]+)"/;
>     my ($gid)   = /db_xref="(GI:\w+)"/;
>     my ($seq)   = /translation="([A-Z\s]+)"/;
>     $seq        =~ s/\s+//g;
> 
>     print F "$gname|$ref|$gid\n$seq\n";
> }
> 
> close F;
> 
> John
> --
> use Perl;
> program
> fulfillment
> 
> --
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]

-- 
*******************************************************************
PEDRO A. RECHE , pHD            TL: 617 632 3824
Dana-Farber Cancer Institute,   FX: 617 632 4569
Harvard Medical School,         EM: [EMAIL PROTECTED]
44 Binney Street, D1510A,       EM: [EMAIL PROTECTED]              
Boston, MA 02115                URL: http://www.reche.org                              
                 
*******************************************************************

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to