Hi Peter,
First thank you so much for the quick reply! I'm only interested in single
positions, so I think you're right that fetching those positions is not the
best way. Right now I'm getting a whole chromosome, and then parsing
through that to find the positions I'm interested in. That seems to be
slightly better, but it still just takes so long to get all that data.
Maybe I should try something between single position and entire chromosome.
My bam files are 70-80 GBs and they are high coverage (40X). I'm expecting
between a day or two to do one whole genome, so if I have a lot of genomes
it would take months. I guess I'm hoping there's some faster way to read
pileup information from these bam files, maybe by sacrificing something
mplileup provides, but that's probably a long shot...
Thanks again for the response!
-Alissa
On Tue, Jan 27, 2015 at 1:36 AM, Peter Cock <p.j.a.c...@googlemail.com>
wrote:
> Hi Alissa,
>
> How big (base pair length) are the pileup regions you are using?
> I would guess too small would mean much more overhead fetching
> chunks of data, while too large would mean heavier RAM needs.
>
> Whey you say "most positions in the genomes", if that really is
> the majority of the bases, it might be faster not to bother with
> the indexing API, but simply iterate over all the reads in the BAM
> file in the order on disk. You will get some reads covering the
> (small fraction of) positions you are not interested in, but the
> disk IO should be mush more efficient (no seeking).
>
> I don't have any numbers for you - I'm sure a lot would depend
> on the genome size, and coverage depth.
>
> Regards,
>
> Peter
>
> On Tue, Jan 27, 2015 at 12:41 AM, Alissa Severson
> <aseve...@systemsbiology.org> wrote:
> > Hi all,
> >
> > I have a set of many (hundreds) of genomes which I'm hoping to analyze.
> For
> > most positions in the genomes, I want to count the number of A's, G's,
> C's,
> > T's. So far I've been doing this by generating the pileup and then
> parsing
> > the data at the positions I'm interested in. The problem is this process
> is
> > taking a ton of time. Just for one individual on chr20, my entire script
> > took 36 minutes, 34 of which were spent generating the pileup. If I want
> to
> > run this on the entire genome for all the samples it will simply take too
> > long to be reasonable.
> >
> > I'm hoping for some advice or ideas one how to improve this. I've tried
> > using -r and -l together, and it actually slightly slows samtools down
> > versus generating the entire chromosome and filtering for the positions I
> > need. My script is written in java, so I was thinking of trying to
> > write/alter a GATK walker which I could call directly from my script,
> but I
> > don't have any experience with walkers so I'm not thrilled about that
> > option. I've also done a bit of looking around and found piledriver,
> part of
> > bamtools, but haven't tried to use it yet. Does anyone have a sense of
> how
> > their performance compares?
> >
> > Although I don't really know what I'm talking about, it seems that if I
> have
> > a sorted list of the positions I'm interested in and just need to get
> counts
> > for each base at those positions, that it shouldn't be so computationally
> > demanding.
> >
> > Thanks for any and all the help/advice! I really appreciate it!
> > -Alissa
> >
> >
> ------------------------------------------------------------------------------
> > Dive into the World of Parallel Programming. The Go Parallel Website,
> > sponsored by Intel and developed in partnership with Slashdot Media, is
> your
> > hub for all things parallel software development, from weekly thought
> > leadership blogs to news, videos, case studies, tutorials and more. Take
> a
> > look and join the conversation now. http://goparallel.sourceforge.net/
> > _______________________________________________
> > Samtools-help mailing list
> > Samtools-help@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/samtools-help
> >
>
------------------------------------------------------------------------------
Dive into the World of Parallel Programming. The Go Parallel Website,
sponsored by Intel and developed in partnership with Slashdot Media, is your
hub for all things parallel software development, from weekly thought
leadership blogs to news, videos, case studies, tutorials and more. Take a
look and join the conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Samtools-help mailing list
Samtools-help@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/samtools-help