Re: [Bioc-devel] BamTallyParam argument 'which'

Hervé Pagès Wed, 25 Feb 2015 13:37:25 -0800

Thanks Leonardo. I'll add this to the man page. This will definitely
help clarify the "duplicated selection" issue.


H.

On 02/24/2015 07:40 PM, Leonardo Collado Torres wrote:

Related to my post on a separate thread
(https://stat.ethz.ch/pipermail/bioc-devel/2015-February/006978.html),
I think that if 'which' is not being reduced by default, a simple
example showing the effects of this could be included in the functions
that have such an argument. Also note that 'reducing' could lead to
unintended results.

For example, in the help page for GenomicAlignments::readGAlignments,
after the 'gal4' example it would be nice to add something like this:


## Note that if overlapping ranges are provided in 'which'
## reads could be selected more than once. This would artificually
## increase the coverage or affect other downstream results.
## If you 'reduce' the ranges, reads that originally overlapped
## two disjoint segments will be included.

which_dups <- RangesList(seq1=rep(IRanges(1000, 2000), 2),
     seq2=IRanges(c(100, 1000), c(1000, 2000)))
param_dups <- ScanBamParam(which=which_dups)
param_reduced <- ScanBamParam(which=reduce(which_dups))
gal4_dups <- readGAlignments(bamfile, param=param_dups)
gal4_reduced <- readGAlignments(bamfile, param=param_reduced)


length(gal4)

## Duplicates some reads. In this case, all the ones between
## bases 1000 and 2000 on seq1.
length(gal4_dups)

## Includes some reads that mapped around base 1000 in seq2
## that were excluded in gal4.
length(gal4_reduced)








Here's the output:

library('GenomicAlignments')

## Code already included in ?readGAlignments
bamfile <- system.file("extdata", "ex1.bam", package="Rsamtools",

+                        mustWork=TRUE)

which <- RangesList(seq1=IRanges(1000, 2000),

+                     seq2=IRanges(c(100, 1000), c(1000, 2000)))

param <- ScanBamParam(which=which)
gal4 <- readGAlignments(bamfile, param=param)
gal4

GAlignments object with 2404 alignments and 0 metadata columns:
          seqnames strand       cigar    qwidth     start       end
width     njunc
             <Rle>  <Rle> <character> <integer> <integer> <integer>
<integer> <integer>
      [1]     seq1      +         35M        35       970      1004
    35         0
      [2]     seq1      +         35M        35       971      1005
    35         0
      [3]     seq1      +         35M        35       972      1006
    35         0
      [4]     seq1      +         35M        35       973      1007
    35         0
      [5]     seq1      +         35M        35       974      1008
    35         0
      ...      ...    ...         ...       ...       ...       ...
   ...       ...
   [2400]     seq2      +         35M        35      1524      1558
    35         0
   [2401]     seq2      +         35M        35      1524      1558
    35         0
   [2402]     seq2      -         35M        35      1528      1562
    35         0
   [2403]     seq2      -         35M        35      1532      1566
    35         0
   [2404]     seq2      -         35M        35      1533      1567
    35         0
   -------
   seqinfo: 2 sequences from an unspecified genome


## Note that if overlapping ranges are provided in 'which'
## reads could be selected more than once. This would artificually
## increase the coverage or affect other downstream results.
## If you 'reduce' the ranges, reads that originally overlapped
## two disjoint segments will be included.

which_dups <- RangesList(seq1=rep(IRanges(1000, 2000), 2),

+     seq2=IRanges(c(100, 1000), c(1000, 2000)))

param_dups <- ScanBamParam(which=which_dups)
param_reduced <- ScanBamParam(which=reduce(which_dups))
gal4_dups <- readGAlignments(bamfile, param=param_dups)
gal4_reduced <- readGAlignments(bamfile, param=param_reduced)


length(gal4)

[1] 2404


## Duplicates some reads. In this case, all the ones between
## bases 1000 and 2000 on seq1.
length(gal4_dups)

[1] 3014


## Includes some reads that mapped around base 1000 in seq2
## that were excluded in gal4.
length(gal4_reduced)

[1] 2343






options(width = 120)
devtools::session_info()

Session 
info-----------------------------------------------------------------------------------------------------------
  setting  value
  version  R Under development (unstable) (2014-11-01 r66923)
  system   x86_64, darwin10.8.0
  ui       AQUA
  language (EN)
  collate  en_US.UTF-8
  tz       America/New_York

Packages---------------------------------------------------------------------------------------------------------------
  package           * version date       source
  base64enc           0.1.2   2014-06-26 CRAN (R 3.2.0)
  BatchJobs           1.4     2014-09-24 CRAN (R 3.2.0)
  BBmisc              1.7     2014-06-21 CRAN (R 3.2.0)
  BiocGenerics      * 0.13.4  2014-12-31 Bioconductor
  BiocParallel        1.1.13  2015-01-27 Bioconductor
  Biostrings        * 2.35.8  2015-02-14 Bioconductor
  bitops              1.0.6   2013-08-17 CRAN (R 3.2.0)
  brew                1.0.6   2011-04-13 CRAN (R 3.2.0)
  checkmate           1.5.0   2014-10-19 CRAN (R 3.2.0)
  codetools           0.2.9   2014-08-21 CRAN (R 3.2.0)
  DBI                 0.3.1   2014-09-24 CRAN (R 3.2.0)
  devtools            1.6.1   2014-10-07 CRAN (R 3.2.0)
  digest              0.6.4   2013-12-03 CRAN (R 3.2.0)
  fail                1.2     2013-09-19 CRAN (R 3.2.0)
  foreach             1.4.2   2014-04-11 CRAN (R 3.2.0)
  GenomeInfoDb      * 1.3.13  2015-02-13 Bioconductor
  GenomicAlignments * 1.3.27  2015-01-26 Bioconductor
  GenomicRanges     * 1.19.37 2015-02-13 Bioconductor
  IRanges           * 2.1.38  2015-02-08 Bioconductor
  iterators           1.0.7   2014-04-11 CRAN (R 3.2.0)
  Rsamtools         * 1.19.27 2015-02-07 Bioconductor
  RSQLite             1.0.0   2014-10-25 CRAN (R 3.2.0)
  rstudioapi          0.1     2014-03-27 CRAN (R 3.2.0)
  S4Vectors         * 0.5.20  2015-02-19 Bioconductor
  sendmailR           1.2.1   2014-09-21 CRAN (R 3.2.0)
  stringr             0.6.2   2012-12-06 CRAN (R 3.2.0)
  XVector           * 0.7.4   2015-02-08 Bioconductor
  zlibbioc            1.13.1  2015-02-11 Bioconductor






On Mon, Feb 23, 2015 at 2:38 PM, Leonard Goldstein
<goldstein.leon...@gene.com> wrote:

Sounds very sensible not to double count in the context of tallying
variants. I was more concerned with reducing which as the default
behavior for scanBam and other functions.

I wanted to bring up the samtools behavior as - for me at least -
inconsistencies between Rsamtools and samtools have been another
source of confusion in the past (e.g. different naming conventions for
fields like isize vs TLEN etc.)

Leonard


On Mon, Feb 23, 2015 at 11:22 AM, Michael Lawrence
<lawrence.mich...@gene.com> wrote:

Maybe Rsamtools would want to follow this precedent. I think there might be
a difference between fishing out alignments from a SAM/BAM, and deriving a
summary (tallyVariants) from a BAM. It seems like an argument could be made
for a tally set to not contain duplicates.

On Mon, Feb 23, 2015 at 11:05 AM, Leonard Goldstein
<goldstein.leon...@gene.com> wrote:


Hi Michael and Thomas,

I ran into the same problem in the past (i.e. when I started working
with functions like scanBam I expected them not to return the same
alignment multiple times)

One thing to consider might be that returning alignments multiple
times is consistent with the behavior of the samtools view command.
Quoting from the samtools manual:

“Important note: when multiple regions are given, some alignments may
be output multiple times if they overlap more than one of the
specified regions.”

Maybe there is an argument for keeping things consistent with
samtools? As you said, if documented properly, the user can decide
whether to reduce regions specified in which or not.

Leonard


On Mon, Feb 23, 2015 at 10:52 AM, Michael Lawrence
<lawrence.mich...@gene.com> wrote:

We should at leaast try to avoid surprising the user. Seems like most
people expect "which" to be a simple restriction, so I think for now I
will
just reduce the which, and if someone has a use case for separate
queries,
we can address it in the future.

On Mon, Feb 23, 2015 at 10:41 AM, Thomas Sandmann
<sandmann.tho...@gene.com>
wrote:

Personally, I don't have a use case with "meaningful loci" worth
tracking,
so keeping it simple would work for me.

Incidentally, would it be good to deal with the 'which' parameter in a
consistent way across different methods ? I just saw this recent post
on
the mailing list in which a used got confused by duplicate counts
returned
after passing 'which' to scanBamParam:

https://stat.ethz.ch/pipermail/bioc-devel/2015-February/006978.html


---

Thomas Sandmann, PhD
Computational biologist

Genentech, Inc.
1 DNA Way
South San Francisco, CA 94080
USA

Phone: +1 650 225 6273
Fax: +1 650 225 5389
Email: sandmann.tho...@gene.com

"If a man will begin with certainties, he shall end in doubts; but if
he
will be content to begin with doubts he shall end in certainties." --
Sir
Francis Bacon


On Mon, Feb 23, 2015 at 10:37 AM, Michael Lawrence <
lawrence.mich...@gene.com> wrote:

We just have to decide which is the more useful interpretation of
which
-- as a simple restriction, or as a vector of meaningful locii, which
will
be analyzed individually? I would actually favor the first one (the
same as
yours), just because it's simpler. To keep track of the query ranges,
we
would need to add a new column to the returned object, which will more
often than not just be clutter. I guess we could introduce a new
parameter,
"reduceWhich" which defaults to TRUE and reduces the which. If FALSE,
it
instead adds the column mapping back to the original which ranges.


On Sun, Feb 22, 2015 at 2:36 PM, Thomas Sandmann <
sandmann.tho...@gene.com> wrote:

Hi Michael,

ah, I see. I hadn't realized that returning the pileups separately
for
each region could be a desired feature, but that makes sense. I
agree, as
it is easy for the user to 'reduce' the ranges beforehand your first
option
(e.g. returning the ID of the range) would be more flexible.

Perhaps you would consider adding a sentence to the documentation of
'which' on BamTallyParam's help page explaining that users might want
to
'reduce' their ranges beforehand if they are only interested in a
single
tally for each base ?

Thanks a lot !
Thomas


         [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpa...@fredhutch.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Re: [Bioc-devel] BamTallyParam argument 'which'

Reply via email to