(As I noted in my reply to Christian, I believe my problem was caused by an issue with file permissions expiring, and I think I've resolved it).
Marcel Martin wrote: > Piping is extremely powerful: It avoids creating huge temporary files > and even gives you some automatic parallelization. Use it! :) Yep. I did create the intermediate files Christian suggested, albeit on much smaller files, while validating my process. On the full files it's not practical. In addition to the disk space used, the extra I/O added would greatly increase the run time. And increased network bandwidth to the disks could impact other programs running concurrently. > Bob, I have no good idea regarding the original problem, but I have a > few comments. If you want to create a truncated BAM file, you can do > it > like this: Yesterday, contemplating a workaround, I did resurrect some code I wrote a few years ago that pull segments out of bam files. I'd forgotten that I had written it (I haven't had many occasions to use it). A pre-processing step scans the entire BAM file and builds an index of the in-bam location of every Nth record (in my case I used N=200K). "Locations" are equivalent to BGZF virtual addresses as described at the end of section 3.1 in the SAM spec (the section describing BAM). An extractor program can then extract any of these N-record blocks by reading the index, seeking into the BAM file, reading bam records and writing them to BAM output. The extractor precedes those blocks by copying the header block, and then appends an EOF block. So the output looks like a real bam file/stream. My workaround was then going to be to run many jobs, each suckling on a different segment of the BAM file. And do a merge afterwards. At least this way, if I got failures during this filtering stage at least they'd happen sooner, and re-running individual segments wouldn't take as long (and could themselves be subdivided). > Interestingly, at least in my samtools version (0.1.19-96b5f2294a, > Ubuntu 14.04), the EOF marker check is also done (and fails) when you > pipe in a BAM file from stdin: > > $ samtools view -b merged.bam | samtools view - > /dev/null > [bam_header_read] EOF marker is absent. The input is probably > truncated. Yeah, I hear ya. While I was working on my workaround yesterday, I discovered that same misleading warning. bgzf_check_EOF tries to seek to the end-of-file record, fails (because it's a pipe), and reports this to bam_header_read in a way that is indistinguishable from the last part of hte file not being an end-of-file record. bam_header_read looks like it has some code intended to suppress the warning message for pipes but on my machine that's not working. Since I was piping the output of my BAM extractor into samtools view, I got that message and thought it meant the EOF my extractor was emitting was somehow flawed. It wasn't. An easy fix would be for bgzf_check_EOF to report a different failure code for "I can't seek". At least it seems like an easy fix. But I don't know the code well enough to be certain. Thanks for your ideas and info, Bob H ------------------------------------------------------------------------------ Open source business process management suite built on Java and Eclipse Turn processes into business applications with Bonita BPM Community Edition Quickly connect people, data, and systems into organized workflows Winner of BOSSIE, CODIE, OW2 and Gartner awards http://p.sf.net/sfu/Bonitasoft _______________________________________________ Samtools-help mailing list Samtools-help@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/samtools-help