Re: [Bioc-sig-seq] Loading large BAM files

2011-07-14 Thread Martin Morgan

On 07/14/2011 08:47 AM, Ivan Gregoretti wrote:

Hello Martin,

I am currently reading in data with scanBam and making use of the
ScanBamParam function.

My bam data looks like this:


names(bam1)

[1] "chr1:1-197195432" "chr2:1-181748087" "chr3:1-159599783"


class(bam1)

[1] "list"


class(bam1$"chr1:1-197195432")

[1] "list"


class(bam1$"chr1:1-197195432"$qname)

[1] "character"

How do we consolidate all three lists (chr1, chr2, chr3) into a single list?

I read the Rsamtools documentation but the examples do not quite apply
to this case.


These are lists-of-lists, and the somewhat opaque way of getting 
elements out is along the lines of


  unlist(lapply(bam1, "[[", "qname"))

This paradigm seems not to work for XStringSet, for which

  do.call(c, unname(lapply(bam1, "[[", "seq")))

Martin




Thank you,

Ivan





On Wed, Jul 13, 2011 at 5:25 PM, Ivan Gregoretti  wrote:

Hi Martin,

I will have to play a little bit both with the counter function and
with simplify2array. It's not evident for me what they are intended to
do.

I can answer your question about BAM size. Our samples are all from
Illumina HiSeq now. So, we are talking about no less than 100 million
good quality tags per lane.

In different occasions I have run into the problem of the 2^31 -1
limit and I am always thinking about ways to process large amounts of
information by chunks and in parallel. With time, sequencers will only
output more tags so we will will run into this problems only more
often.

The suggestion of using ScanBamParam is good, thank you, but at this
early stage of sample analysis I need to load the whole BAM. Some
sample quality assessments are in order.

Thank you,

Ivan




On Wed, Jul 13, 2011 at 5:04 PM, Martin Morgan  wrote:

On 07/13/2011 01:57 PM, Martin Morgan wrote:


On 07/13/2011 01:36 PM, Ivan Gregoretti wrote:


Hi everybody,

As I wait for my large BAM to be read in by scanBAM, I can't help but
to wonder:

Has anybody tried combining scanBam with multicore to load the
chromosomes in parallel?

That would require

1) to merge the chunks at the end and

2) the original BAM to be indexed.

Does anybody have any experience to share?


Was wondering how large and long we're talking about?

Use of ScanBamParam(what=...) can help.

For some tasks I'd think of a coarser granularity, e.g., in the context
of multiple bam files so that the data reduction (to a vector of
10,000's of counts) occurs on each core.

counter = function(fl, genes) {
aln = readGappedAlignments(fl)
strand(aln) = "*"
hits = countOverlaps(aln, genes)
countOverlaps(genes, aln[hits==1])
}
simplify2array(mclapply(bamFiles, counter, genes))

One issue I understand people have is that mclapply uses 'serialize()'
to convert the return value of each function to a raw vector. raw
vectors have the same total length limit as any other vector (2^31 -1
elements) and this places a limit on the size of chunk returned by each
core. Also I believe that exceeding the limit can silently corrupt the
data (i.e., a bug). This is second-hand information.


Should also have mentioned that casting a GRanges object to RangesList
provides the appropriate list to iterate over chromosomes, and that
ScanBamParam will accept a which of class RangesList.

Martin



Martin



Thank you,

Ivan

___
Bioc-sig-sequencing mailing list
Bioc-sig-sequencing@r-project.org
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing






--
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

Location: M1-B861
Telephone: 206 667-2793






--
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

Location: M1-B861
Telephone: 206 667-2793

___
Bioc-sig-sequencing mailing list
Bioc-sig-sequencing@r-project.org
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing


Re: [Bioc-sig-seq] Loading large BAM files

2011-07-14 Thread Ivan Gregoretti
Hello Martin,

I am currently reading in data with scanBam and making use of the
ScanBamParam function.

My bam data looks like this:

> names(bam1)
[1] "chr1:1-197195432" "chr2:1-181748087" "chr3:1-159599783"

> class(bam1)
[1] "list"

> class(bam1$"chr1:1-197195432")
[1] "list"

> class(bam1$"chr1:1-197195432"$qname)
[1] "character"

How do we consolidate all three lists (chr1, chr2, chr3) into a single list?

I read the Rsamtools documentation but the examples do not quite apply
to this case.

Thank you,

Ivan





On Wed, Jul 13, 2011 at 5:25 PM, Ivan Gregoretti  wrote:
> Hi Martin,
>
> I will have to play a little bit both with the counter function and
> with simplify2array. It's not evident for me what they are intended to
> do.
>
> I can answer your question about BAM size. Our samples are all from
> Illumina HiSeq now. So, we are talking about no less than 100 million
> good quality tags per lane.
>
> In different occasions I have run into the problem of the 2^31 -1
> limit and I am always thinking about ways to process large amounts of
> information by chunks and in parallel. With time, sequencers will only
> output more tags so we will will run into this problems only more
> often.
>
> The suggestion of using ScanBamParam is good, thank you, but at this
> early stage of sample analysis I need to load the whole BAM. Some
> sample quality assessments are in order.
>
> Thank you,
>
> Ivan
>
>
>
>
> On Wed, Jul 13, 2011 at 5:04 PM, Martin Morgan  wrote:
>> On 07/13/2011 01:57 PM, Martin Morgan wrote:
>>>
>>> On 07/13/2011 01:36 PM, Ivan Gregoretti wrote:

 Hi everybody,

 As I wait for my large BAM to be read in by scanBAM, I can't help but
 to wonder:

 Has anybody tried combining scanBam with multicore to load the
 chromosomes in parallel?

 That would require

 1) to merge the chunks at the end and

 2) the original BAM to be indexed.

 Does anybody have any experience to share?
>>>
>>> Was wondering how large and long we're talking about?
>>>
>>> Use of ScanBamParam(what=...) can help.
>>>
>>> For some tasks I'd think of a coarser granularity, e.g., in the context
>>> of multiple bam files so that the data reduction (to a vector of
>>> 10,000's of counts) occurs on each core.
>>>
>>> counter = function(fl, genes) {
>>> aln = readGappedAlignments(fl)
>>> strand(aln) = "*"
>>> hits = countOverlaps(aln, genes)
>>> countOverlaps(genes, aln[hits==1])
>>> }
>>> simplify2array(mclapply(bamFiles, counter, genes))
>>>
>>> One issue I understand people have is that mclapply uses 'serialize()'
>>> to convert the return value of each function to a raw vector. raw
>>> vectors have the same total length limit as any other vector (2^31 -1
>>> elements) and this places a limit on the size of chunk returned by each
>>> core. Also I believe that exceeding the limit can silently corrupt the
>>> data (i.e., a bug). This is second-hand information.
>>
>> Should also have mentioned that casting a GRanges object to RangesList
>> provides the appropriate list to iterate over chromosomes, and that
>> ScanBamParam will accept a which of class RangesList.
>>
>> Martin
>>
>>>
>>> Martin
>>>

 Thank you,

 Ivan

 ___
 Bioc-sig-sequencing mailing list
 Bioc-sig-sequencing@r-project.org
 https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>>
>>>
>>
>>
>> --
>> Computational Biology
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
>>
>> Location: M1-B861
>> Telephone: 206 667-2793
>>
>

___
Bioc-sig-sequencing mailing list
Bioc-sig-sequencing@r-project.org
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing


Re: [Bioc-sig-seq] Loading large BAM files

2011-07-13 Thread Ivan Gregoretti
Hi Martin,

I will have to play a little bit both with the counter function and
with simplify2array. It's not evident for me what they are intended to
do.

I can answer your question about BAM size. Our samples are all from
Illumina HiSeq now. So, we are talking about no less than 100 million
good quality tags per lane.

In different occasions I have run into the problem of the 2^31 -1
limit and I am always thinking about ways to process large amounts of
information by chunks and in parallel. With time, sequencers will only
output more tags so we will will run into this problems only more
often.

The suggestion of using ScanBamParam is good, thank you, but at this
early stage of sample analysis I need to load the whole BAM. Some
sample quality assessments are in order.

Thank you,

Ivan




On Wed, Jul 13, 2011 at 5:04 PM, Martin Morgan  wrote:
> On 07/13/2011 01:57 PM, Martin Morgan wrote:
>>
>> On 07/13/2011 01:36 PM, Ivan Gregoretti wrote:
>>>
>>> Hi everybody,
>>>
>>> As I wait for my large BAM to be read in by scanBAM, I can't help but
>>> to wonder:
>>>
>>> Has anybody tried combining scanBam with multicore to load the
>>> chromosomes in parallel?
>>>
>>> That would require
>>>
>>> 1) to merge the chunks at the end and
>>>
>>> 2) the original BAM to be indexed.
>>>
>>> Does anybody have any experience to share?
>>
>> Was wondering how large and long we're talking about?
>>
>> Use of ScanBamParam(what=...) can help.
>>
>> For some tasks I'd think of a coarser granularity, e.g., in the context
>> of multiple bam files so that the data reduction (to a vector of
>> 10,000's of counts) occurs on each core.
>>
>> counter = function(fl, genes) {
>> aln = readGappedAlignments(fl)
>> strand(aln) = "*"
>> hits = countOverlaps(aln, genes)
>> countOverlaps(genes, aln[hits==1])
>> }
>> simplify2array(mclapply(bamFiles, counter, genes))
>>
>> One issue I understand people have is that mclapply uses 'serialize()'
>> to convert the return value of each function to a raw vector. raw
>> vectors have the same total length limit as any other vector (2^31 -1
>> elements) and this places a limit on the size of chunk returned by each
>> core. Also I believe that exceeding the limit can silently corrupt the
>> data (i.e., a bug). This is second-hand information.
>
> Should also have mentioned that casting a GRanges object to RangesList
> provides the appropriate list to iterate over chromosomes, and that
> ScanBamParam will accept a which of class RangesList.
>
> Martin
>
>>
>> Martin
>>
>>>
>>> Thank you,
>>>
>>> Ivan
>>>
>>> ___
>>> Bioc-sig-sequencing mailing list
>>> Bioc-sig-sequencing@r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>
>>
>
>
> --
> Computational Biology
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
>
> Location: M1-B861
> Telephone: 206 667-2793
>

___
Bioc-sig-sequencing mailing list
Bioc-sig-sequencing@r-project.org
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing


Re: [Bioc-sig-seq] Loading large BAM files

2011-07-13 Thread Martin Morgan

On 07/13/2011 01:57 PM, Martin Morgan wrote:

On 07/13/2011 01:36 PM, Ivan Gregoretti wrote:

Hi everybody,

As I wait for my large BAM to be read in by scanBAM, I can't help but
to wonder:

Has anybody tried combining scanBam with multicore to load the
chromosomes in parallel?

That would require

1) to merge the chunks at the end and

2) the original BAM to be indexed.

Does anybody have any experience to share?


Was wondering how large and long we're talking about?

Use of ScanBamParam(what=...) can help.

For some tasks I'd think of a coarser granularity, e.g., in the context
of multiple bam files so that the data reduction (to a vector of
10,000's of counts) occurs on each core.

counter = function(fl, genes) {
aln = readGappedAlignments(fl)
strand(aln) = "*"
hits = countOverlaps(aln, genes)
countOverlaps(genes, aln[hits==1])
}
simplify2array(mclapply(bamFiles, counter, genes))

One issue I understand people have is that mclapply uses 'serialize()'
to convert the return value of each function to a raw vector. raw
vectors have the same total length limit as any other vector (2^31 -1
elements) and this places a limit on the size of chunk returned by each
core. Also I believe that exceeding the limit can silently corrupt the
data (i.e., a bug). This is second-hand information.


Should also have mentioned that casting a GRanges object to RangesList 
provides the appropriate list to iterate over chromosomes, and that 
ScanBamParam will accept a which of class RangesList.


Martin



Martin



Thank you,

Ivan

___
Bioc-sig-sequencing mailing list
Bioc-sig-sequencing@r-project.org
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing






--
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

Location: M1-B861
Telephone: 206 667-2793

___
Bioc-sig-sequencing mailing list
Bioc-sig-sequencing@r-project.org
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing


Re: [Bioc-sig-seq] Loading large BAM files

2011-07-13 Thread Martin Morgan

On 07/13/2011 01:36 PM, Ivan Gregoretti wrote:

Hi everybody,

As I wait for my large BAM to be read in by scanBAM, I can't help but to wonder:

Has anybody tried combining scanBam with multicore to load the
chromosomes in parallel?

That would require

1) to merge the chunks at the end and

2) the original BAM to be indexed.

Does anybody have any experience to share?


Was wondering how large and long we're talking about?

Use of ScanBamParam(what=...) can help.

For some tasks I'd think of a coarser granularity, e.g., in the context 
of multiple bam files so that the data reduction (to a vector of 
10,000's of counts) occurs on each core.


  counter = function(fl, genes) {
  aln = readGappedAlignments(fl)
  strand(aln) = "*"
  hits = countOverlaps(aln, genes)
  countOverlaps(genes, aln[hits==1])
  }
  simplify2array(mclapply(bamFiles, counter, genes))

One issue I understand people have is that mclapply uses 'serialize()' 
to convert the return value of each function to a raw vector. raw 
vectors have the same total length limit as any other vector (2^31 -1 
elements) and this places a limit on the size of chunk returned by each 
core. Also I believe that exceeding the limit can silently corrupt the 
data (i.e., a bug). This is second-hand information.


Martin



Thank you,

Ivan

___
Bioc-sig-sequencing mailing list
Bioc-sig-sequencing@r-project.org
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing



--
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

Location: M1-B861
Telephone: 206 667-2793

___
Bioc-sig-sequencing mailing list
Bioc-sig-sequencing@r-project.org
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing


Re: [Bioc-sig-seq] Loading large BAM files

2011-07-13 Thread Ivan Gregoretti
Hi Kasper,

I was monitoring the processes of the computer while scanBam was
running and the process is clearly CPU limited.

That's why I brought up the point.

Ivan


Ivan Gregoretti, PhD
National Institute of Diabetes and Digestive and Kidney Diseases
National Institutes of Health
5 Memorial Dr, Building 5, Room 205.
Bethesda, MD 20892. USA.
Phone: 1-301-496-1016 and 1-301-496-1592
Fax: 1-301-496-9878



On Wed, Jul 13, 2011 at 4:45 PM, Kasper Daniel Hansen
 wrote:
> On Wed, Jul 13, 2011 at 4:36 PM, Ivan Gregoretti  wrote:
>> Hi everybody,
>>
>> As I wait for my large BAM to be read in by scanBAM, I can't help but to 
>> wonder:
>>
>> Has anybody tried combining scanBam with multicore to load the
>> chromosomes in parallel?
>>
>> That would require
>>
>> 1) to merge the chunks at the end and
>>
>> 2) the original BAM to be indexed.
>>
>> Does anybody have any experience to share?
>
> Multicore will help if your task is cpu limited (and you can chunk
> it).  However, your specific task is probably I/O limited, in which
> case it is not clear that it would help to have more cores; in fact it
> might be detrimental depending on your filesystem and hardware.
>
> Kasper
>

___
Bioc-sig-sequencing mailing list
Bioc-sig-sequencing@r-project.org
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing


Re: [Bioc-sig-seq] Loading large BAM files

2011-07-13 Thread Kasper Daniel Hansen
On Wed, Jul 13, 2011 at 4:36 PM, Ivan Gregoretti  wrote:
> Hi everybody,
>
> As I wait for my large BAM to be read in by scanBAM, I can't help but to 
> wonder:
>
> Has anybody tried combining scanBam with multicore to load the
> chromosomes in parallel?
>
> That would require
>
> 1) to merge the chunks at the end and
>
> 2) the original BAM to be indexed.
>
> Does anybody have any experience to share?

Multicore will help if your task is cpu limited (and you can chunk
it).  However, your specific task is probably I/O limited, in which
case it is not clear that it would help to have more cores; in fact it
might be detrimental depending on your filesystem and hardware.

Kasper

___
Bioc-sig-sequencing mailing list
Bioc-sig-sequencing@r-project.org
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing