Re: [Samtools-help] bams with malformed region names lead to variant calls using -l to be incorrect

Thomas W. Blackwell Fri, 24 Mar 2017 04:49:56 -0700

Liyang  -

First, you might try samtools version 1.4 from github or 
sourceforge ... but I suspect the problem will remain.


I suspect that you are wanting to call variants across multiple 
.bam files, using the -b option of samtools mpileup.  If that's 
true, then by far the easiest option is to fix every .bam file 
using samtools reheader.

If the list of ~1M sequence contigs is the same in every .bam file, 
then extract a copy of the header from one of the .bam files using
          samtools view -H file.bam > tmp.header, 
use a text editor to change the bad contig names in file tmp.header, 
make a separate directory for the new .bam files and then use a 
foreach loop something like (using csh syntax):

     cd ../new.dir
     foreach file ( ../old.dir/*.bam )
     samtools reheader -P tmp.header $file > `basename $file`
     end

This simple version loses any header information that changes between 
one .bam file and the next (such as sample and library names), but for 
a quick look that may be good enough.  There's a more complicated version 
in which the file tmp.header contains only the header lines that stay the 
same for every .bam file and you append the variable information before 
each call to samtools reheader.  (samtools reheader will read from a pipe.)

                                                -  tom blackwell  -

On Thu, 23 Mar 2017, Liyang Diao wrote:

> Dear all,
>
> I have a large number of bam files, where the number of reference "genomes"
> is very large, about 1M (bacterial marker gene alignments). A small
> fraction of these genomes is poorly named, resulting in the following error
> when I run mpileup:
>
> Could not parse the header line: ##contig=<ID=BADNAMES>"
>
> Since this was a small fraction of the references and I am only interested
> in a preliminary exploratory analysis, I went ahead and looked into the
> VCFs that were generated, assuming (wrongly?) that alignments to these
> areas would simply be ignored.
>
> What I found, however, was that the variants called are incorrect--for
> example, I have high-confidence SNPs found in regions of zero coverage.
>
> So I am wondering if there is an easy workaround to this problem, or if I
> will have to perform realignments of the data, removing or renaming the
> culprit references.
>
> I found that, for some reason, using the -r POSITION flag in mpileup
> appears to give reasonable results, but that -l produces bad results as
> before, but through searching this help archive I found that -r cannot
> accept multiple positions in file format.
>
> Any help would be greatly appreciated!
> Thanks
>

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Samtools-help mailing list
Samtools-help@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/samtools-help

Re: [Samtools-help] bams with malformed region names lead to variant calls using -l to be incorrect

Reply via email to