Hi,
I accidentally concatenated several gunzip -c runs into one fastq file,
aligned using STAR and produced SAM output.
My question is, is the following Perl 'one-liner' sufficient to remove
redundancy from the SAM file?
perl -ne '
use Digest::MD5 qw(md5);
if ( /^@/ ){ print }
else{
print unless $x{md5($_)}++
}
' $sam > $sam.trim
I'm asking because I'm seeing huge reductions in file size, for example
from 187G to 27G or 15G to .1G.
As far as I can tell this is fine, and I don't expect a significant number
of hashing collisions, I'm just wondering if I overlooked something.
Cheers,
Dan.
------------------------------------------------------------------------------
New Year. New Location. New Benefits. New Data Center in Ashburn, VA.
GigeNET is offering a free month of service with a new server in Ashburn.
Choose from 2 high performing configs, both with 100TB of bandwidth.
Higher redundancy.Lower latency.Increased capacity.Completely compliant.
http://p.sf.net/sfu/gigenet
_______________________________________________
Samtools-help mailing list
Samtools-help@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/samtools-help