Hi Alissa,

How big (base pair length) are the pileup regions you are using?
I would guess too small would mean much more overhead fetching
chunks of data, while too large would mean heavier RAM needs.

Whey you say "most positions in the genomes", if that really is
the majority of the bases, it might be faster not to bother with
the indexing API, but simply iterate over all the reads in the BAM
file in the order on disk. You will get some reads covering the
(small fraction of) positions you are not interested in, but the
disk IO should be mush more efficient (no seeking).

I don't have any numbers for you - I'm sure a lot would depend
on the genome size, and coverage depth.

Regards,

Peter

On Tue, Jan 27, 2015 at 12:41 AM, Alissa Severson
<aseve...@systemsbiology.org> wrote:
> Hi all,
>
> I have a set of many (hundreds) of genomes which I'm hoping to analyze. For
> most positions in the genomes, I want to count the number of A's, G's, C's,
> T's. So far I've been doing this by generating the pileup and then parsing
> the data at the positions I'm interested in. The problem is this process is
> taking a ton of time. Just for one individual on chr20, my entire script
> took 36 minutes, 34 of which were spent generating the pileup. If I want to
> run this on the entire genome for all the samples it will simply take too
> long to be reasonable.
>
> I'm hoping for some advice or ideas one how to improve this. I've tried
> using -r and -l together, and it actually slightly slows samtools down
> versus generating the entire chromosome and filtering for the positions I
> need. My script is written in java, so I was thinking of trying to
> write/alter a GATK walker which I could call directly from my script, but I
> don't have any experience with walkers so I'm not thrilled about that
> option. I've also done a bit of looking around and found piledriver, part of
> bamtools, but haven't tried to use it yet. Does anyone have a sense of how
> their performance compares?
>
> Although I don't really know what I'm talking about, it seems that if I have
> a sorted list of the positions I'm interested in and just need to get counts
> for each base at those positions, that it shouldn't be so computationally
> demanding.
>
> Thanks for any and all the help/advice! I really appreciate it!
> -Alissa
>
> ------------------------------------------------------------------------------
> Dive into the World of Parallel Programming. The Go Parallel Website,
> sponsored by Intel and developed in partnership with Slashdot Media, is your
> hub for all things parallel software development, from weekly thought
> leadership blogs to news, videos, case studies, tutorials and more. Take a
> look and join the conversation now. http://goparallel.sourceforge.net/
> _______________________________________________
> Samtools-help mailing list
> Samtools-help@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/samtools-help
>

------------------------------------------------------------------------------
Dive into the World of Parallel Programming. The Go Parallel Website,
sponsored by Intel and developed in partnership with Slashdot Media, is your
hub for all things parallel software development, from weekly thought
leadership blogs to news, videos, case studies, tutorials and more. Take a
look and join the conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Samtools-help mailing list
Samtools-help@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/samtools-help

Reply via email to