Hi,

I accidentally concatenated several gunzip -c runs into one fastq file,
aligned using STAR and produced SAM output.

My question is, is the following Perl 'one-liner' sufficient to remove
redundancy from the SAM file?

perl -ne '
  use Digest::MD5 qw(md5);
  if ( /^@/ ){ print }
  else{
    print unless $x{md5($_)}++
  }
' $sam > $sam.trim

I'm asking because I'm seeing huge reductions in file size, for example
from 187G to 27G or 15G to .1G.

As far as I can tell this is fine, and I don't expect a significant number
of hashing collisions, I'm just wondering if I overlooked something.


Cheers,
Dan.
------------------------------------------------------------------------------
New Year. New Location. New Benefits. New Data Center in Ashburn, VA.
GigeNET is offering a free month of service with a new server in Ashburn.
Choose from 2 high performing configs, both with 100TB of bandwidth.
Higher redundancy.Lower latency.Increased capacity.Completely compliant.
http://p.sf.net/sfu/gigenet
_______________________________________________
Samtools-help mailing list
Samtools-help@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/samtools-help

Reply via email to