(As I noted in my reply to Christian, I believe my problem was caused  
by an issue with file permissions expiring, and I think I've resolved  
it).

Marcel Martin wrote:
> Piping is extremely powerful: It avoids creating huge temporary files
> and even gives you some automatic parallelization. Use it! :)

Yep.  I did create the intermediate files Christian suggested, albeit  
on much smaller files, while validating my process.  On the full files  
it's not practical.  In addition to the disk space used, the extra I/O  
added would greatly increase the run time.  And increased network  
bandwidth to the disks could impact other programs running concurrently.

> Bob, I have no good idea regarding the original problem, but I have a
> few comments. If you want to create a truncated BAM file, you can do  
> it
> like this:

Yesterday, contemplating a workaround, I did resurrect some code I  
wrote a few years ago that pull segments out of bam files.  I'd  
forgotten that I had written it (I haven't had many occasions to use  
it).  A pre-processing step scans the entire BAM file and builds an  
index of the in-bam location of every Nth record (in my case I used  
N=200K). "Locations" are equivalent to BGZF virtual addresses as  
described at the end of section 3.1 in the SAM spec (the section  
describing BAM).

An extractor program can then extract any of these N-record blocks by  
reading the index, seeking into the BAM file, reading bam records and  
writing them to BAM output.  The extractor precedes those blocks by  
copying the header block, and then appends an EOF block.  So the  
output looks like a real bam file/stream.

My workaround was then going to be to run many jobs, each suckling on  
a different segment of the BAM file.  And do a merge afterwards.  At  
least this way, if I got failures during this filtering stage at least  
they'd happen sooner, and re-running individual segments wouldn't take  
as long (and could themselves be subdivided).

> Interestingly, at least in my samtools version (0.1.19-96b5f2294a,
> Ubuntu 14.04), the EOF marker check is also done (and fails) when you
> pipe in a BAM file from stdin:
>
> $ samtools view -b merged.bam | samtools view - > /dev/null
> [bam_header_read] EOF marker is absent. The input is probably  
> truncated.

Yeah, I hear ya.  While I was working on my workaround yesterday, I  
discovered that same misleading warning.  bgzf_check_EOF tries to seek  
to the end-of-file record, fails (because it's a pipe), and reports  
this to bam_header_read in a way that is indistinguishable from the  
last part of hte file not being an end-of-file record.   
bam_header_read looks like it has some code intended to suppress the  
warning message for pipes but on my machine that's not working.

Since I was piping the output of my BAM extractor into samtools view,  
I got that message and thought it meant the EOF my extractor was  
emitting was somehow flawed.  It wasn't.

An easy fix would be for bgzf_check_EOF to report a different failure  
code for "I can't seek".  At least it seems like an easy fix.  But I  
don't know the code well enough to be certain.

Thanks for your ideas and info,
Bob H


------------------------------------------------------------------------------
Open source business process management suite built on Java and Eclipse
Turn processes into business applications with Bonita BPM Community Edition
Quickly connect people, data, and systems into organized workflows
Winner of BOSSIE, CODIE, OW2 and Gartner awards
http://p.sf.net/sfu/Bonitasoft
_______________________________________________
Samtools-help mailing list
Samtools-help@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/samtools-help

Reply via email to