Hello again,

Thanks to everyone for their helpful suggestions.  I finally got it to
work, using the following script.  However, it takes about 5 hours to
run on a fast computer.  Using grep (in bash), on the other hand, takes
about 5 minutes (see below if you are interested).  Thanks again!

SLOW perl script:

#!/usr/bin/perl -w

use strict;

my $IDs = 'ID_all_X';

unless (open(IDFILE, $IDs)) {
        print "Could not open file $IDs!\n";
        }

my $probes = 'HG_U95Av2_probe_fasta';

unless (open(PROBES, $probes)) {
        print "Could not open file $probes!\n";
        }

open (OUT,'>','probe_subset.txt') or die "Can't write output: $!";

my @ID = <IDFILE>;
print @ID;
chomp @ID;

while (my $line = <PROBES>) {
        foreach my $identifier (@ID) {
                if($line=~/^>probe:\w+:$identifier:/) {
                                print OUT $line;
                                print OUT scalar(<PROBES>);
                }
        }
}
exit;


FAST bash script:

#!/usr/bin/bash
exec<"ID_all_X"
while read line; do
        echo $line
        grep -A 1 :$line: HG_U95Av2_probe_fasta >>myresults.txt
done


-----Original Message-----
From: Muma W. [mailto:[EMAIL PROTECTED]
Sent: Wednesday, June 14, 2006 12:20 AM
To: Beginners List
Subject: Re: A loop to parse a large text file--output is empty!


Michael Oldham wrote:
> Dear all,
>
> I am a Perl newbie struggling to accomplish a conceptually simple
> bioinformatics task.  I have a single large file containing about
> 200,000 DNA probe sequences (from an Affymetrix microarray), each of
> which is accompanied by a header, like so (this is in FASTA format):
>
>> probe:HG_U95Av2:1138_at:395:301; Interrogation_Position=2631;
> Antisense;
> TGGCTCCTGCTGAGGTCCCCTTTCC
>> probe:HG_U95Av2:107_at:543:519; Interrogation_Position=258;
Antisense;
> CTACTCTCGTGGTGCACAAGGAGTG
>> probe:HG_U95Av2:1156_at:528:483; Interrogation_Position=2054;
> Antisense;
> TGCAGGTGGCAGATCTGCAGTCCAT
>> probe:HG_U95Av2:1102_s_at:541:589; Interrogation_Position=4316;
> Antisense;
> GTGAAGGTTGCTGAGGCTCTGACCC
>
> .........etc.
>
> What I would like to do is extract from this file a subset of ~130,800
> probes (both the header and the sequence) and output this subset into
a
> new text file in the same (FASTA) format.  These 130,800 probes
> correspond to 8,175 probe set IDs ("1138_at" is an example of a probe
> set ID in the header listed above).  I have these 8,175 IDs listed in
a
> separate file called "ID.txt" and the 200,000 probe sequences in a
file
> called "HG_U95Av2_probe_fasta.txt".  The script below is missing
> something because the output file ("probe_subset.txt") is blank.  This
> is also the case if I replace the file "ID.txt" with a file consisting
> of a single probe set ID (e.g. 1138_at).  Does anyone know what I am
> missing?  I am running this script in Cygwin on Windows XP.  I
> appreciate any suggestions!
>
> ~ Mike O.
> [...]

You'd have to modify this to work on your data, but it's close to what
you're looking for I think. For my convenience, I considered only the
numeric portions of the IDs, and I put the 'Antisense' strings on the
same lines as the 'probe' lines. (I assume it's this way in the original
file and the current look is due to MUA wrapping). I hope this helps.


use strict;
use warnings;

# This data should come from HG_U95Av2_probe_fasta.txt.
my $probe = q{
>probe:HG_U95Av2:1138_at:395:301; Interrogation_Position=2631;
Antisense;
TGGCTCCTGCTGAGGTCCCCTTTCC
>probe:HG_U95Av2:107_at:543:519; Interrogation_Position=258; Antisense;
CTACTCTCGTGGTGCACAAGGAGTG
>probe:HG_U95Av2:1156_at:528:483; Interrogation_Position=2054;
Antisense;
TGCAGGTGGCAGATCTGCAGTCCAT
>probe:HG_U95Av2:1102_s_at:541:589; Interrogation_Position=4316;
Antisense;
GTGAAGGTTGCTGAGGCTCTGACCC
};

# @ids should come from ID.txt.
my @ids = (1102);

my $fh;
open $fh, '<', \$probe or die("Couldn't open memory file\n");

while ($_ = <$fh>) {
     next unless m/^>probe:[^:]+:(\d+)/;
     my $id = $1;
     if (grep $id eq $_, @ids) {
         print;
         $_ = <$fh>;
         print;
     }
}

close $fh;




--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>


--
No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.1.394 / Virus Database: 268.8.4/363 - Release Date: 6/13/2006

--
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.1.394 / Virus Database: 268.9.2/373 - Release Date: 6/22/2006


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>


Reply via email to