Sorry, I forgot to copy the help listserve to note that Thomas's suggestion
did indeed solve my problem, text below, along with Thomas's response:

~~~~~~~~~~~~~~~~~~~
Dear Thomas,

Thank you for your advice! I was not aware of the reheader command. I was
able to correct the headers and at first glance at least I am no longer
seeing SNPs called in known zero-coverage regions.

If you wouldn't mind, out of curiosity, does reheader actually only replace
the header of the bam file? None of the "bad" regions had any reads aligned
to them, but if they did, would they be referred to in output by the
corrected names or the original names?

Thanks again for your help! It saved me a lot of time.
Best,
Liyang
~~~~~~~~~~~~~~~~~~~~~~
Liyang  -

Glad this helped.

Yes, samtools reheader only replaces the header text.  The binary data
content of the .bam file is unchanged.  In .bam format, each reference
sequence contig is identified by an integer (perhaps zero based).  When
converting to ascii sam format (or in mpileup) the string value shown for
the ith reference sequence contig is the "SN:" value from the ith "@SQ"
line from the header.  So, when replacing the header, it's essential to
keep the "@SQ" lines in the same order.  Otherwise, the association between
integers and strings gets mixed up.  (These are all details that I didn't
want to worry you with before.)

I suspect that with the old .bam files, the header lines with malformed
values were simply ignored, so the string values chosen for later reference
sequence contigs would be off by the number of preceding malformed header
lines.  So the variant calls were correct -- they simply weren't given the
correct reference sequence names.  But that's just a guess.  I haven't
tested whether it's correct.

On Fri, Mar 24, 2017 at 7:48 AM, Thomas W. Blackwell <tbla...@umich.edu>
wrote:

> Liyang  -
>
> First, you might try samtools version 1.4 from github or sourceforge ...
> but I suspect the problem will remain.
>
> I suspect that you are wanting to call variants across multiple .bam
> files, using the -b option of samtools mpileup.  If that's true, then by
> far the easiest option is to fix every .bam file using samtools reheader.
>
> If the list of ~1M sequence contigs is the same in every .bam file, then
> extract a copy of the header from one of the .bam files using
>          samtools view -H file.bam > tmp.header, use a text editor to
> change the bad contig names in file tmp.header, make a separate directory
> for the new .bam files and then use a foreach loop something like (using
> csh syntax):
>
>     cd ../new.dir
>     foreach file ( ../old.dir/*.bam )
>     samtools reheader -P tmp.header $file > `basename $file`
>     end
>
> This simple version loses any header information that changes between one
> .bam file and the next (such as sample and library names), but for a quick
> look that may be good enough.  There's a more complicated version in which
> the file tmp.header contains only the header lines that stay the same for
> every .bam file and you append the variable information before each call to
> samtools reheader.  (samtools reheader will read from a pipe.)
>
>                                                 -  tom blackwell  -
>
>
> On Thu, 23 Mar 2017, Liyang Diao wrote:
>
> Dear all,
>>
>> I have a large number of bam files, where the number of reference
>> "genomes"
>> is very large, about 1M (bacterial marker gene alignments). A small
>> fraction of these genomes is poorly named, resulting in the following
>> error
>> when I run mpileup:
>>
>> Could not parse the header line: ##contig=<ID=BADNAMES>"
>>
>> Since this was a small fraction of the references and I am only interested
>> in a preliminary exploratory analysis, I went ahead and looked into the
>> VCFs that were generated, assuming (wrongly?) that alignments to these
>> areas would simply be ignored.
>>
>> What I found, however, was that the variants called are incorrect--for
>> example, I have high-confidence SNPs found in regions of zero coverage.
>>
>> So I am wondering if there is an easy workaround to this problem, or if I
>> will have to perform realignments of the data, removing or renaming the
>> culprit references.
>>
>> I found that, for some reason, using the -r POSITION flag in mpileup
>> appears to give reasonable results, but that -l produces bad results as
>> before, but through searching this help archive I found that -r cannot
>> accept multiple positions in file format.
>>
>> Any help would be greatly appreciated!
>> Thanks
>>
>>
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Samtools-help mailing list
Samtools-help@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/samtools-help

Reply via email to