Hi Florian, Marc,
On 09/18/2013 11:55 PM, Hahne, Florian wrote:
Hi Marc, Herve,
I also noticed this behaviour:
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
seqlevels(txdb, force=TRUE) <- c("chr1", "ch2")
oldLevs <- seqlevels(txdb)
seqlevels(txdb, force=TRUE) <- "chr1"
seqlevels(txdb, force=TRUE) <- oldLevs
Error in .seqinfo.TranscriptDbReplace(x, new2old = new2old, force =
force,
:
The replacement value must be either a 1 to 1 replacement or a
subset of
the original set when replacing the 'seqinfo' of a TranscriptDb object
But:
restoreSeqlevels(txdb)
seqlevels(txdb, force=TRUE) <- oldLevs
This is probably intentional, but I found it to be rather confusing. If
one should be able to use the seqlevels replacement method to control
active and inactive chromosomes in the TranscriptDb object, then having
this mandatory step of restoring all chromosomes to the active state
followed by another restriction to be quite cumbersome. At least a
somewhat more useful error message would help in that case. Or couldn't
the replacement method always call restoreSeqlevels internally.
Somewhat related to this, I also found the following issue:
> library(TxDb.Hsapiens.UCSC.hg19.knownGene)
> txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
> seqlevels(txdb, force=TRUE) <- "chrM"
> seqlengths(txdb)
chrM
16571
> seqlevels(txdb) <- "chr1"
> seqlengths(txdb)
chr1
16571
The 2nd call to the seqlevels() setter actually performed a
*renaming* operation (a clue for this is that I didn't have to
use force=TRUE). Renaming the seqlevels of a TranscriptDb object
is something I think we want to support, e.g. when there is the
need to use a different naming convention (like in
seqlevels(txdb) <- "M"). This is exactly the syntax that one
would use to rename the seqlevels of a GRanges object:
> gr <- GRanges("chrM", IRanges(1:2, 10))
> seqlevels(gr) <- "chr1"
> gr
GRanges with 2 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr1 [1, 10] *
[2] chr1 [2, 10] *
---
seqlengths:
chr1
NA
There is nothing that prevents the user from doing this kind of silly
renaming except that, in the case of the TranscriptDb object, this is
almost certainly not what the user intended to do. What s/he really
wanted was selecting chr1 instead of chrM, which can be achieved with:
> restoreSeqlevels(txdb)
> seqlevels(txdb, force=TRUE) <- "chr1"
> seqlengths(txdb)
chr1
249250621
Yes having to restore all chromosomes to the active state before one
can make a new selection is cumbersome so we probably need to revisit
this.
It would also help to point out somewhere (I may have missed it) that
TranscriptDb seqlevels (or rather the internal GRanges) are actually
pass
by reference:
txdb2 <- txdb
seqlevels(txdb)
seqlevels(txdb2)
restoreSeqlevels(txdb)
seqlevels(txdb2)
That's because TranscriptDb is a reference class. IMHO it shouldn't.
Also Herve: what is the definition of a "used" seqlevel? Does that mean
that a factor level is defined, but no range in the GRanges object is
assigned this level?
Yes. Like here:
> gr <- GRanges("chrM", IRanges(1:2, 10), seqlengths=c(chr1=5000,
chrM=25))
> gr
GRanges with 2 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chrM [1, 10] *
[2] chrM [2, 10] *
---
seqlengths:
chr1 chrM
5000 25
Only chrM is in use:
> seqlevelsInUse(gr)
[1] "chrM"
The idiom to drop seqlevels that are not in use is:
> seqlevels(gr) <- seqlevelsInUse(gr)
> gr
GRanges with 2 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chrM [1, 10] *
[2] chrM [2, 10] *
---
seqlengths:
chrM
25
Of course, in that case, force=TRUE is not needed because the operation
is guaranteed to not "shrink" the GRanges object.
Why would those exists in a TranscriptDb object at
all?
The notion of "seqlevels in use" is not formally defined for a
TranscriptDb object:
> seqlevelsInUse(txdb)
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ŒseqlevelsInUse¹
for signature Œ"TranscriptDb"¹
but it could be: a seqlevel or chromosome could be considered to be
in use if there is at least 1 feature (transcript, exon or CDS) in
the db that is on it. I think for most TranscriptDb objects, all the
seqlevels are actually in use. However it's conceivable that a
TranscriptDb object could have some extra seqlevels that are not in
use (the seqlevels + their lengths + their circularity flags are stored
in a separate table, the 'chrominfo' table).
Anyway, when I added the seqlevelsInUse() generic a few months ago,
I didn't write a method for TranscriptDb objects because I couldn't
think of a use case for it. And also because without any change to
the current db schema, this would be a costly operation as it would
need to scan the entire database, which cannot be done with a single
query.
Now with the seqlevels() setter allowing to reduce the seqlevels
of a TranscriptDb object, we face the following choices: (a)
implement the seqlevelsInUse,TranscriptDb method and make the
'force' arg of the seqlevels() setter behave consistently with
that, or (b) not implement the seqlevelsInUse,TranscriptDb method
and make the 'force' arg meaningless. As I said earlier, my preference
goes for (b).
H.
Florian
On 9/18/13 9:08 PM, "Hervé Pagès" <hpa...@fhcrc.org> wrote:
Note that currently it's kind of inconsistent anyway because it doesn't
look at the seqlevels that are in use. For example:
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
Trying to drop chrUn_gl000249 (which I know is not in use):
> seqlevels(txdb) <- setdiff(seqlevels(txdb), "chrUn_gl000249")
Error in .seqinfo.TranscriptDbReplace(x, new2old = new2old, force =
force, :
You need to use force=TRUE if you want to drop seqlevels.
'force=TRUE' should only be required when trying to drop seqlevels
that are in use.
So the choice are more between: (a) consistent, (b) convenient,
or (c) leave it as it is (which is neither convenient or consistent).
I vote for (b).
H.
On 09/18/2013 11:31 AM, Marc Carlson wrote:
I actually considered this, but I opted to do it this way just for the
sake of being consistent (which was my whole mission for implementing
seqlevels in here in the 1st place). Now I could make it more
convenient here and break consistency with how it is used elsewhere,
but
what do people prefer?
Consistent or convenient?
Marc
On 09/18/2013 10:40 AM, Hervé Pagès wrote:
Hi Marc,
Wouldn't it make sense to just ignore the 'force' arg when
dropping the seqlevels of a TranscriptDb?
The 'force' argument is FALSE by default and this prevents
seqlevels<- to shrink GRanges or other vector-like objects
when the user tries to drop seqlevels that are in use.
Internally seqlevels<- calls seqlevelsInUse() to get the
seqlevels currently in use and see if they intersect with
the seqlevels to drop.
In the TranscriptDb situation, people always have to use
'force=TRUE' to drop seqlevels, regardless of whether the
levels to drop are in use or not (the seqlevelsInUse()
getter not being defined for TranscriptDb objects, I suspect
seqlevels<- doesn't look at this).
So maybe 'force' could just be ignored for TranscriptDb objects?
That would make seqlevels<- a little bit more user-friendly on
those objects.
Thanks,
H.
On 09/13/2013 10:38 AM, Marc Carlson wrote:
Hi Florian,
Yes we are trying to make things more uniform. seqlevels() lets you
rename as well as deactivate chromosomes you want to ignore, so it
was
really redundant with isActiveSeq(). So we are moving away from
isActiveSeq() just so that users have less to learn about. The
reason
why isActiveSeq was different from seqlevels was just because it was
born for a TranscriptDb (which is based on an annotation database)
instead of being born on a GRanges object. So seqlevels was the
more
general tool.
Marc
On 09/13/2013 07:24 AM, Hahne, Florian wrote:
Hi Marc,
I saw these warnings in Gviz, but they stem from GenomicFeatures
Warning messages:
1: 'isActiveSeq' is deprecated.
Use 'seqlevels' instead.
See help("Deprecated") and help("GenomicFeatures-deprecated").
2: 'isActiveSeq' is deprecated.
Use 'seqlevels' instead.
See help("Deprecated") and help("GenomicFeatures-deprecated").
3: 'isActiveSeq<-' is deprecated.
Use 'seqlevels' instead.
See help("Deprecated") and help("GenomicFeatures-deprecated").
4: 'isActiveSeq<-' is deprecated.
Use 'seqlevels' instead.
See help("Deprecated") and help("GenomicFeatures-deprecated").
5: 'isActiveSeq' is deprecated.
Use 'seqlevels' instead.
See help("Deprecated") and help("GenomicFeatures-deprecated").
6: 'isActiveSeq<-' is deprecated.
Use 'seqlevels' instead.
See help("Deprecated") and help("GenomicFeatures-deprecated").
So has the whole idea of active chromosomes in the data base been
dropped? I could not find anything in the change notes. Do I get it
right that you can now do
seqlevels(txdb, force=TRUE) <- "chr1"
if you just want the first chromosome to be active?
Florian
[[alternative HTML version deleted]]
_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpa...@fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpa...@fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319