On 05/21/2013 08:00 AM, Michael Lawrence wrote:
On Mon, May 20, 2013 at 10:36 PM, Hervé Pagès <hpa...@fhcrc.org <mailto:hpa...@fhcrc.org>> wrote: Michael, On 05/20/2013 08:44 PM, Michael Lawrence wrote: On Mon, May 20, 2013 at 4:13 PM, Hervé Pagès <hpa...@fhcrc.org <mailto:hpa...@fhcrc.org> <mailto:hpa...@fhcrc.org <mailto:hpa...@fhcrc.org>>> wrote: Hi, On 05/20/2013 03:15 PM, Michael Lawrence wrote: On Mon, May 20, 2013 at 3:11 PM, Martin Morgan <mtmor...@fhcrc.org <mailto:mtmor...@fhcrc.org> <mailto:mtmor...@fhcrc.org <mailto:mtmor...@fhcrc.org>>> wrote: Hi Michael -- On 5/20/2013 5:56 AM, Michael Lawrence wrote: While it's cool that seqlevels<- has become so flexible, I still claim that rename/keep/drop would be a lot easier to read, because they describe the high-level operation, and there's no reason to deprecate them. They're also easier to remember. For example, for dropping with seqlevels<-, the user needs to remember that "force" is necessary to drop. Much easier to just say "dropSeqlevels(), please". And reimplementing keepSeqlevels is still too complicated. Is it such a maintenance burden to have these simple wrappers sit on top of the low-level manipulators? Another suggestion: renameSeqlevels should not require a named vector (in cases where it is unnamed, require that it is parallel to the existing seqlevels, as with seqlevels<-). I didn't really indicate what drove my desire to see keepSeqlevels deprecated. keepSeqlevels, seqlevels<-, and isActiveSeq were developed more or less independently, and have different contracts. The different contracts are confusing to the user, as is creating a usable help system when there are multiple end points for what is a common operation. The help pages of each were inadequate in different ways. And because the code was developed independently, support for different objects was inconsistent. So actually, yes, the maintenance (and use) burden for the previous state of affairs was substantial. On the other hand, I agree that keepSeqlevels is convenient as a simple wrapper around seqlevels<-, in the same way that setNames and names<- are both useful. So we could iterate to keepSeqlevels <- function(x, value, ...) { seqlevels(x, force=TRUE) <- value x } but I'd be less enthusiastic about maintaining the original contract of keepSeqlevels, where keepSeqlevels(gr1, gr2), would have worked for two GRanges objects. Why would this be called keepSeqlevels() if, depending on how it's used, it will either add, drop, rename, or permute the seqlevels? Couldn't this be called setSeqlevels? I thought that new2old was necessary for permuting. The seqlevels setter has no 'new2old' arg. You're right that adding should be disallowed for keepSeqlevels(). Adding seqlevels is not a common operation. The two common operations are: * Permuting, usually because the data were imported in non-canonical order (seqnameStyle was designed to address this, no?). Permuting is straightforward with seqlevels<-: > seqlevels(gr) [1] "chr1" "chr2" "chr3" > seqlevels(gr) <- rev(seqlevels(gr)) > seqlevels(gr) [1] "chr3" "chr2" "chr1" * Subsetting, either by keeping or dropping. Also straightforward: > seqlevels(gr, force=TRUE) <- "chr2" > seqlevels(gr) [1] "chr2" Two main reasons: taking a small slice of the data for prototyping, or removing problematic chromosomes (sex, circular). This goes back to bsapply and the exclude argument. There are other important use cases: - Dropping seqlevels that are *not* in use. This happens for example when subsetting a BAM file with Rsamtools::filterBam to keep only the alignments located on a given chromosome. The entire sequence dictionary (in the header) is propagated to the resulting BAM file so when you read the file with readGAlignments() you end up with a bunch of useless seqlevels that you generally start by dropping. You don't want to drop any alignment so force=FALSE is your friend. This is sort of tangential to the discussion, but do you really want to do this? I would preserve the universe as given by the BAM. - An even more common operation is renaming: 90% of the times that's what I use seqlevels<- for, and that's what I tell people to use when they need to rename their chromosomes. Also straightforward: > seqlevels(gr) [1] "chr1" "chr2" "chr3" "chrM" > seqlevels(gr) <- sub("^chr", "", seqlevels(gr)) > seqlevels(gr) [1] "1" "2" "3" "M" > seqlevels(gr)[seqlevels(gr) == "M"] <- "MT" > seqlevels(gr) [1] "1" "2" "3" "MT" Note that this is simpler to use than renameSeqlevels() which always required you to build the entire right value as a named vector. Sure, renaming is a common use case. Not sure how I forgot that. As I wrote earlier in the thread, renameSeqlevels should be changed so as not to require naming of the vector. So the added-value of keepSeqlevels() seems to boil down to: (a) it always uses force=TRUE (b) it preserves the original object and returns the modified object If you want to restrict the use of keepSeqlevels() to permuting and dropping, you'll also have to disallow renaming (in addition to adding). Then its name will more or less reflect what it does (if "keep" means "permute" or "drop"). The final result will be something that does less than setSeqlevels() and that is not necessarily easier to read: both will set on the object whatever vector of seqlevels they are supplied, except that one will fail when doing so would actually result in adding or renaming seqlevels. So it looks like seqlevels<- is pretty powerful. Now I see that if I scroll down to the examples in the man page, I find the actual documentation. It's cool to learn that we can replace seqlevels on TxDb objects.
Has been working on TranscriptDb objects for more than
My argument has always been that since these low-level operations are so powerful, it's nice to have high-level operations to clarify the code. We have to think hard to know what the RHS will do in that replacement.
When you do something like seqlevels(gr) <- new_seqlevels, you're saying "can you please replace the current seqlevels by those new ones". That's all. The request/contract is simple and all the complexity is taken care of internally by seqlevels<-. It might be that, depending on what's currently on 'gr', this will actually result in dropping, adding, permuting. Or *a combination of the 3*. Like here: > seqlevels(gr) [1] "chr1" "chr2" "chr3" > seqlevels(gr) <- c("chr4", "chr3", "chr1") > seqlevels(gr) [1] "chr4" "chr3" "chr1" # one seqlevel was drop, one was added, and the 2 seqlevels that were # preserved were reordered. When you want to subset a vector to keep all elements with an even index you do x[c(FALSE, TRUE)]. Is that keeping or dropping elements? Do we need to have a subsetting operator for keeping elements, another one for dropping elements, and another one for permuting them (like in x[sample(length(x))])? H.
It's probably one of the most complex replacement operations; far more complex than the typical one, including levels<- on factor. There's nothing wrong with that; it's just complexity that we should be able to hide. Michael H. Michael H. Well, I wasn't even aware of that feature, so you've made your point about the documentation ;) Sounds like a good compromise to me. Thanks for understanding, Michael Martin Michael On Sat, May 18, 2013 at 6:00 PM, Martin Morgan <mtmor...@fhcrc.org <mailto:mtmor...@fhcrc.org> <mailto:mtmor...@fhcrc.org <mailto:mtmor...@fhcrc.org>> <mailto:mtmor...@fhcrc.org <mailto:mtmor...@fhcrc.org> <mailto:mtmor...@fhcrc.org <mailto:mtmor...@fhcrc.org>>>> wrote: On 05/18/2013 05:42 PM, Martin Morgan wrote: Some of the most common operations are straight-forward, e.g., > gr = GRanges(paste0("chr", c(1:22, "X", "Y")), IRanges(1, 100)) > seqlevels(gr) = sub("chr", "ch", seqlevels(gr)) To get a more comprehensive example I should have followed my own advice and grabbed from the help page! ## Rename: seqlevels(txdb) <- sub("chr", "", seqlevels(txdb)) seqlevels(txdb) seqlevels(txdb) <- paste0("CH", seqlevels(txdb)) seqlevels(txdb) seqlevels(txdb)[seqlevels(__**____txdb) == "CHM"] <- "M" seqlevels(txdb) -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793 <tel:%28206%29%20667-2793> <tel:%28206%29%20667-2793> <tel:%28206%29%20667-2793> -- Dr. Martin Morgan, PhD Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 [[alternative HTML version deleted]] ___________________________________________________ Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org> <mailto:Bioc-devel@r-project.__org <mailto:Bioc-devel@r-project.org>> mailing list https://stat.ethz.ch/mailman/____listinfo/bioc-devel <https://stat.ethz.ch/mailman/__listinfo/bioc-devel> <https://stat.ethz.ch/mailman/__listinfo/bioc-devel <https://stat.ethz.ch/mailman/listinfo/bioc-devel>> -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fhcrc.org <mailto:hpa...@fhcrc.org> <mailto:hpa...@fhcrc.org <mailto:hpa...@fhcrc.org>> Phone: (206) 667-5791 <tel:%28206%29%20667-5791> <tel:%28206%29%20667-5791> Fax: (206) 667-1319 <tel:%28206%29%20667-1319> <tel:%28206%29%20667-1319> -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fhcrc.org <mailto:hpa...@fhcrc.org> Phone: (206) 667-5791 <tel:%28206%29%20667-5791> Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
-- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319 _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel