Hello again, Thanks to everyone for their helpful suggestions. I finally got it to work, using the following script. However, it takes about 5 hours to run on a fast computer. Using grep (in bash), on the other hand, takes about 5 minutes (see below if you are interested). Thanks again!
SLOW perl script: #!/usr/bin/perl -w use strict; my $IDs = 'ID_all_X'; unless (open(IDFILE, $IDs)) { print "Could not open file $IDs!\n"; } my $probes = 'HG_U95Av2_probe_fasta'; unless (open(PROBES, $probes)) { print "Could not open file $probes!\n"; } open (OUT,'>','probe_subset.txt') or die "Can't write output: $!"; my @ID = <IDFILE>; print @ID; chomp @ID; while (my $line = <PROBES>) { foreach my $identifier (@ID) { if($line=~/^>probe:\w+:$identifier:/) { print OUT $line; print OUT scalar(<PROBES>); } } } exit; FAST bash script: #!/usr/bin/bash exec<"ID_all_X" while read line; do echo $line grep -A 1 :$line: HG_U95Av2_probe_fasta >>myresults.txt done -----Original Message----- From: Muma W. [mailto:[EMAIL PROTECTED] Sent: Wednesday, June 14, 2006 12:20 AM To: Beginners List Subject: Re: A loop to parse a large text file--output is empty! Michael Oldham wrote: > Dear all, > > I am a Perl newbie struggling to accomplish a conceptually simple > bioinformatics task. I have a single large file containing about > 200,000 DNA probe sequences (from an Affymetrix microarray), each of > which is accompanied by a header, like so (this is in FASTA format): > >> probe:HG_U95Av2:1138_at:395:301; Interrogation_Position=2631; > Antisense; > TGGCTCCTGCTGAGGTCCCCTTTCC >> probe:HG_U95Av2:107_at:543:519; Interrogation_Position=258; Antisense; > CTACTCTCGTGGTGCACAAGGAGTG >> probe:HG_U95Av2:1156_at:528:483; Interrogation_Position=2054; > Antisense; > TGCAGGTGGCAGATCTGCAGTCCAT >> probe:HG_U95Av2:1102_s_at:541:589; Interrogation_Position=4316; > Antisense; > GTGAAGGTTGCTGAGGCTCTGACCC > > .........etc. > > What I would like to do is extract from this file a subset of ~130,800 > probes (both the header and the sequence) and output this subset into a > new text file in the same (FASTA) format. These 130,800 probes > correspond to 8,175 probe set IDs ("1138_at" is an example of a probe > set ID in the header listed above). I have these 8,175 IDs listed in a > separate file called "ID.txt" and the 200,000 probe sequences in a file > called "HG_U95Av2_probe_fasta.txt". The script below is missing > something because the output file ("probe_subset.txt") is blank. This > is also the case if I replace the file "ID.txt" with a file consisting > of a single probe set ID (e.g. 1138_at). Does anyone know what I am > missing? I am running this script in Cygwin on Windows XP. I > appreciate any suggestions! > > ~ Mike O. > [...] You'd have to modify this to work on your data, but it's close to what you're looking for I think. For my convenience, I considered only the numeric portions of the IDs, and I put the 'Antisense' strings on the same lines as the 'probe' lines. (I assume it's this way in the original file and the current look is due to MUA wrapping). I hope this helps. use strict; use warnings; # This data should come from HG_U95Av2_probe_fasta.txt. my $probe = q{ >probe:HG_U95Av2:1138_at:395:301; Interrogation_Position=2631; Antisense; TGGCTCCTGCTGAGGTCCCCTTTCC >probe:HG_U95Av2:107_at:543:519; Interrogation_Position=258; Antisense; CTACTCTCGTGGTGCACAAGGAGTG >probe:HG_U95Av2:1156_at:528:483; Interrogation_Position=2054; Antisense; TGCAGGTGGCAGATCTGCAGTCCAT >probe:HG_U95Av2:1102_s_at:541:589; Interrogation_Position=4316; Antisense; GTGAAGGTTGCTGAGGCTCTGACCC }; # @ids should come from ID.txt. my @ids = (1102); my $fh; open $fh, '<', \$probe or die("Couldn't open memory file\n"); while ($_ = <$fh>) { next unless m/^>probe:[^:]+:(\d+)/; my $id = $1; if (grep $id eq $_, @ids) { print; $_ = <$fh>; print; } } close $fh; -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] <http://learn.perl.org/> <http://learn.perl.org/first-response> -- No virus found in this incoming message. Checked by AVG Free Edition. Version: 7.1.394 / Virus Database: 268.8.4/363 - Release Date: 6/13/2006 -- No virus found in this outgoing message. Checked by AVG Free Edition. Version: 7.1.394 / Virus Database: 268.9.2/373 - Release Date: 6/22/2006 -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] <http://learn.perl.org/> <http://learn.perl.org/first-response>