Liyang,
Probably a good idea to check if the contigs in the header of *all* your
BAM files, matched the reference contigs you provided for mpileup.
I suspect if the BAMs have different headers, it could be an issue.
Also, are there a fixed set of "bad" contigs among all BAMs, or in
different BAMs the "bad" contigs might be different?
when the -I option failed, does it only include "good" contigs you tested
with -r?
hope this helps,
Shuoguo
==
SW
On Fri, Mar 24, 2017 at 6:48 AM, Thomas W. Blackwell <tbla...@umich.edu>
wrote:
> Liyang -
>
> First, you might try samtools version 1.4 from github or
> sourceforge ... but I suspect the problem will remain.
>
> I suspect that you are wanting to call variants across multiple
> .bam files, using the -b option of samtools mpileup. If that's
> true, then by far the easiest option is to fix every .bam file
> using samtools reheader.
>
> If the list of ~1M sequence contigs is the same in every .bam file,
> then extract a copy of the header from one of the .bam files using
> samtools view -H file.bam > tmp.header,
> use a text editor to change the bad contig names in file tmp.header,
> make a separate directory for the new .bam files and then use a
> foreach loop something like (using csh syntax):
>
> cd ../new.dir
> foreach file ( ../old.dir/*.bam )
> samtools reheader -P tmp.header $file > `basename $file`
> end
>
> This simple version loses any header information that changes between
> one .bam file and the next (such as sample and library names), but for
> a quick look that may be good enough. There's a more complicated version
> in which the file tmp.header contains only the header lines that stay the
> same for every .bam file and you append the variable information before
> each call to samtools reheader. (samtools reheader will read from a pipe.)
>
> - tom blackwell -
>
> On Thu, 23 Mar 2017, Liyang Diao wrote:
>
> > Dear all,
> >
> > I have a large number of bam files, where the number of reference
> "genomes"
> > is very large, about 1M (bacterial marker gene alignments). A small
> > fraction of these genomes is poorly named, resulting in the following
> error
> > when I run mpileup:
> >
> > Could not parse the header line: ##contig=<ID=BADNAMES>"
> >
> > Since this was a small fraction of the references and I am only
> interested
> > in a preliminary exploratory analysis, I went ahead and looked into the
> > VCFs that were generated, assuming (wrongly?) that alignments to these
> > areas would simply be ignored.
> >
> > What I found, however, was that the variants called are incorrect--for
> > example, I have high-confidence SNPs found in regions of zero coverage.
> >
> > So I am wondering if there is an easy workaround to this problem, or if I
> > will have to perform realignments of the data, removing or renaming the
> > culprit references.
> >
> > I found that, for some reason, using the -r POSITION flag in mpileup
> > appears to give reasonable results, but that -l produces bad results as
> > before, but through searching this help archive I found that -r cannot
> > accept multiple positions in file format.
> >
> > Any help would be greatly appreciated!
> > Thanks
> >
>
> ------------------------------------------------------------
> ------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> _______________________________________________
> Samtools-help mailing list
> Samtools-help@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/samtools-help
>
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Samtools-help mailing list
Samtools-help@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/samtools-help