Re: [Bioc-devel] restrictToSNV for VCF

Hervé Pagès Fri, 21 Mar 2014 15:37:03 -0700

Hi Martin,

On 03/21/2014 01:45 PM, Martin Morgan wrote:

On 03/20/2014 05:20 PM, Hervé Pagès wrote:

Hi,


On 03/19/2014 01:10 PM, Michael Lawrence wrote:

You can apparently use 1D extraction for VCF, which is a little
surprising;
I learned it from restrictToSNV.


This is inherited from SummarizedExperiment:

   > example(SummarizedExperiment)

   > se1
   class: SummarizedExperiment
   dim: 200 6
   exptData(0):
   assays(1): counts
   rownames: NULL
   rowData metadata column names(0):
   colnames(6): A B ... E F
   colData names(1): Treatment

   > se1[1:4]
   class: SummarizedExperiment
   dim: 4 6
   exptData(0):
   assays(1): counts
   rownames: NULL
   rowData metadata column names(0):
   colnames(6): A B ... E F
   colData names(1): Treatment

To me that means that a SummarizedExperiment has a length
(conceptually), and that this length is the number of rows.
It would actually help if a "length" method was defined:

   > length(se1)
   [1] 1


I think of a SummarizedExperiment as fundamentally a matrix with row and
column annotations. 'length' would then be prod(dim(se1))


But it's not defined as such either.

Note that findOverlaps() on SummarizedExperiment objects returns a
Hits object with indices in the 1:nrow(query) or 1:nrow(subject)
range. I'd like to be able to say "in the seq_along(query) or
seq_along(subject) range" because that's what findOverlaps() does
on any other object defined in IRanges/GenomicRanges/GenomicAlignments.
But I can't because that would be inaccurate.

However, it's conceptually true: I can use the indices in the Hits
object to do 1D extractions from the query or subject. This is good
and consistent with any other type of query or subject.

col- and
rownames() are defined but names() is NULL. I guess 1-D sub-setting
isn't matrix-like, but I don't think that removing this 'feature' simply
for consistency sake is worth it;


I was not suggesting that.

I guess the subsetting logical was
copy/pasted from other code without enough thought. head(), tail() could
be implemented if this were somehow useful (I usually use these for
compact display, and that's irrelevant here...);


I still find it sometimes useful to be able to do head() on a big
object when I just want to try things on a few elements first:

  > dim(vcf)
  [1] 1000000       3

  toy <- head(vcf)
  rowData(toy)
  assay(toy)
  isSNV(toy)
  findOverlaps(toy, exons)

It's more convenient and much quicker than having to truncate the
individual results:

  head(rowData(vcf))
  head(assay(vcf))
  head(isSNV(vcf))
  head(findOverlaps(vcf, exons))

I guess what I'm trying to say is that while it helps thinking of
a SummarizedExperiment as fundamentally a matrix, there are already
enough differences with the matrix API to suggest that, unlike for a
matrix, the length of a SummarizedExperiment object is its nb of rows.
It's implicit in many ways and I think that formalizing it would help
rather than hurt. It will still be somewhat a surprise for the
end-user, but not a bigger surprise than the ones s/he gets right
now with seq_along(vcf), vcf[i], isSNV(vcf), findOverlaps(), head(),
etc.. And once s/he gets over it, there won't be anymore surprises:
all these things will be in agreement with length(vcf) and behave
as expected.

Thanks,
H.

rev() on a matrix
doesn't do anything useful.

Martin


That would automatically fix many convenience [ wrappers like head(),
tail(), rev(), etc...

   > head(se1)
   class: SummarizedExperiment
   dim: 1 6
   exptData(0):
   assays(1): counts
   rownames: NULL
   rowData metadata column names(0):
   colnames(6): A B ... E F
   colData names(1): Treatment

   > rev(se1)
   class: SummarizedExperiment
   dim: 1 6
   exptData(0):
   assays(1): counts
   rownames: NULL
   rowData metadata column names(0):
   colnames(6): A B ... E F
   colData names(1): Treatment

Following that logic names(se1) also probably return colnames(se1).

H.





On Wed, Mar 19, 2014 at 1:07 PM, Vincent Carey
<st...@channing.harvard.edu>wrote:




On Wed, Mar 19, 2014 at 4:00 PM, Michael Lawrence <
lawrence.mich...@gene.com> wrote:

It would be nice to have functions like isSNV, isIndel, isDeletion,
etc
that at least provide precise definitions of the terminology. I've
added
these, but they're designed only for VRanges. Should work for
ExpandedVCF.

Also, it would be nice if restrictToSNV just assumed that alt(x)
must be
something with nchar() support (with special handling for any
List), so
that the 'character' vector of alt,VRanges would work immediately.
Basically restrictToSNV should just be x[isSNV(x)]. Is there even a
use-case for the restrictToSNV abstraction if we did that?

for VCF instance it would be x[isSNV(x),] and indeed I think that
would be
sufficient.  i like the idea of having this family of predicates for
variant classes to allow such selections

Michael



On Tue, Mar 18, 2014 at 10:36 AM, Valerie Obenchain
<voben...@fhcrc.org>wrote:

Hi,

I've added a restrictToSNV() function to VariantAnnotation
(1.9.46). The
return value is a subset VCF object containing SNVs only. The
function
operates on CollapsedVCF or ExapandedVCF and the alt(VCF) value
must be
nucleotides (i.e., no structural variants).

A variant is considered a SNV if the nucleotide sequences in both
ref(vcf) and alt(x) are of length 1. I have a question about how
variants
with multiple 'ALT' values should be handled.

Should we consider row 4 a SNV? One 'ALT' is length 1, the other
is not.

ALT <- DNAStringSetList("A", c("TT"), c("G", "A"), c("TT", "C"))
REF <- DNAStringSet(c("G", c("AA"), "T", "G"))

DataFrame(REF, ALT)

DataFrame with 4 rows and 2 columns
              REF                ALT
   <DNAStringSet> <DNAStringSetList>
1              G                  A
2             AA                 TT
3              T                G,A
4              G               TT,C



Thanks.
Valerie

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


    [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpa...@fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Re: [Bioc-devel] restrictToSNV for VCF

Reply via email to