Re: [Bioc-devel] Removal of Information in OrgDb generated from NCBI -- Feedback needed.
Hi Lori, I agree with Michael. I for myself do not need the UNIGENE field. The mot frequent columns that I use are: ENTREZID, REFSEQ, ENSEMBL, SYMBOL, GENENAME, UNIPROT, gene_id and GO. Best, Tobias Am 30.04.20 um 08:46 schrieb Stadler, Michael: Hi Lori Just my two-cents: I would not miss UNIGENE. I am using org.db's mostly to annotate primary gene identifiers (ENTERZID, ENSEMBL) with additional human readable information (SYMBOL, GENENAME), and to map between different primary identifiers. I mostly use, in decreasing order of importance: org.Hs.eg.db org.Mm.eg.db org.Rn.eg.db org.Ce.eg.db org.Dm.eg.db org.Sc.sgd.db And the most frequent columns that I use are (again, decreasing order of importance): ENTREZID ENSEMBL SYMBOL GENENAME UNIPROT GO Maybe there is some indirect other usage that I am not aware of. Best wishes, Michael -Ursprüngliche Nachricht- Von: Bioc-devel Im Auftrag von Shepherd, Lori Gesendet: Mittwoch, 29. April 2020 19:50 An: Bioc-devel@r-project.org Betreff: [Bioc-devel] Removal of Information in OrgDb generated from NCBI -- Feedback needed. Hello Bioconductor maintainers The core team was made aware of an issue with one of the make orgDb functions in AnnotationForge: https://github.com/Bioconductor/AnnotationForge/issues/13 Investigating further, NCBI will no longer be updating the gene2unigene file. The url has moved to an ARCHIVE directory and there is an explanation and notice of retirement found here: ftp://ftp.ncbi.nih.gov/repository/UniGene/README We use this function in creating OrgDb's for the AnnotationHub as well as the recommended way for users to make custom OrgDb's from NCBI. Temporarily we are working on updating the url to the new location but we are thinking of removing the gene2unigene data from the orgDbs. We would like to ask the community especially those that utilize the orgDb objects frequently if this data is still necessary and would the removal of UNIGENE cause a large disruption to current packages/functions/utilization of objects? Any feedback is greatly appreciated. Thank you Lori Shepherd Bioconductor Core Team Roswell Park Comprehensive Cancer Center Department of Biostatistics & Bioinformatics Elm & Carlton Streets Buffalo, New York 14263 This email message may contain legally privileged and/or confidential information. If you are not the intended recipient(s), or the employee or agent responsible for the delivery of this message to the intended recipient(s), you are hereby notified that any disclosure, copying, distribution, or use of this email message is prohibited. If you have received this message in error, please notify the sender immediately by e-mail and delete this email message from your computer. Thank you. [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] How to speed up GRange comparision
Hi Herve, Thank you for your answer. pcompare works fine for me. Here my solution: query <- GRanges(rep("chr1", 4), IRanges(c(1, 5, 9, 20), c(2, 6, 10, 22))) subject <- GRanges(rep("chr1",4), IRanges(c(3, 1, 1, 15), c(4, 2, 2, 21))) out <- vector("numeric", length(query)) out[(which(abs(pcompare(query, subject))<5))] <- 1 out Carey was right that this here is off list. Next time I will pose my question on support.bioconductor.org <http://support.bioconductor.org>. Best, Tobias Am 29.01.20 um 18:02 schrieb Pages, Herve: > Yes poverlaps(). > > Or pcompare(), which should be even faster. But only if you are not > afraid to go low-level. See ?rangeComparisonCodeToLetter for the meaning > of the codes returned by pcompare(). > > H. > > On 1/29/20 08:01, Michael Lawrence via Bioc-devel wrote: >> poverlaps()? >> >> On Wed, Jan 29, 2020 at 7:50 AM web working wrote: >>> Hello, >>> >>> I have two big GRanges objects and want to search for an overlap of the >>> first range of query with the first range of subject. Then take the >>> second range of query and compare it with the second range of subject >>> and so on. Here an example of my problem: >>> >>> # GRanges objects >>> query <- GRanges(rep("chr1", 4), IRanges(c(1, 5, 9, 20), c(2, 6, 10, >>> 22)), id=1:4) >>> subject <- GRanges(rep("chr1",4), IRanges(c(3, 1, 1, 15), c(4, 2, 2, >>> 21)), id=1:4) >>> >>> # The 2 overlaps at the first position should not be counted, because >>> these ranges are at different rows. >>> countOverlaps(query, subject) >>> >>> # Approach 1 (bad style. I have simplified it to understand) >>> dat <- as.data.frame(findOverlaps(query, subject)) >>> indexDat <- apply(dat, 1, function(x) x[1]==x[2]) >>> indexBool <- dat[indexDat,1] >>> out <- rep(FALSE, length(query)) >>> out[indexBool] <- TRUE >>> as.numeric(out) >>> >>> # Approach 2 (bad style and takes too long) >>> out <- vector("numeric", 4) >>> for(i in seq_along(query)) out[i] <- (overlapsAny(query[i], subject[i])) >>> out >>> >>> # Approach 3 (wrong results) >>> as.numeric(overlapsAny(query, subject)) >>> as.numeric(overlapsAny(split(query, 1:4), split(subject, 1:4))) >>> >>> >>> Maybe someone has an idea to speed this up? >>> >>> >>> Best, >>> >>> Tobias >>> >>> ___ >>> Bioc-devel@r-project.org mailing list >>> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel=DwICAg=eRAMFD45gAfqt84VtBcfhQ=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA=FSrHBK59_OMc6EbEtcPhkTVO0cfDgSbQBGFOXWyHhjc=3tZpvRAw7T5dP21u32TRTf4lZ4QFLtmkouKR7TUlJws= >> >> [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
[Bioc-devel] How to speed up GRange comparision
Hello, I have two big GRanges objects and want to search for an overlap of the first range of query with the first range of subject. Then take the second range of query and compare it with the second range of subject and so on. Here an example of my problem: # GRanges objects query <- GRanges(rep("chr1", 4), IRanges(c(1, 5, 9, 20), c(2, 6, 10, 22)), id=1:4) subject <- GRanges(rep("chr1",4), IRanges(c(3, 1, 1, 15), c(4, 2, 2, 21)), id=1:4) # The 2 overlaps at the first position should not be counted, because these ranges are at different rows. countOverlaps(query, subject) # Approach 1 (bad style. I have simplified it to understand) dat <- as.data.frame(findOverlaps(query, subject)) indexDat <- apply(dat, 1, function(x) x[1]==x[2]) indexBool <- dat[indexDat,1] out <- rep(FALSE, length(query)) out[indexBool] <- TRUE as.numeric(out) # Approach 2 (bad style and takes too long) out <- vector("numeric", 4) for(i in seq_along(query)) out[i] <- (overlapsAny(query[i], subject[i])) out # Approach 3 (wrong results) as.numeric(overlapsAny(query, subject)) as.numeric(overlapsAny(split(query, 1:4), split(subject, 1:4))) Maybe someone has an idea to speed this up? Best, Tobias ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] How to use RData files in Bioconductor data and software packages
Hi Herve, thank you for your answer. To be honest I am fine if the data sets can not be loaded with data() solution. Falsely I said a warning occurred during the BiocCheck. A warning occurred during the R check command of my data package. I used the recommended R CMD check environment flags for the check of my package (https://github.com/Bioconductor/Contributions/blob/master/CONTRIBUTING.md#r-cmd-check-environment) (devtools::check(document = FALSE, args = c('--no-build-vignettes'), build_args = c('--resave-data','--no-build-vignettes'))). During the "Building" step of the R CMD check some "strange" behavior occurs. A lot of characters are printed at the screen: ... ─ checking for empty or unneeded directories ─ looking to see if a ‘data/datalist’ file should be added dispersionFunction fitType varLogDispEsts dispPriorVar dispFunction fit d means disps minDisp means class class class class rowRanges unlistData elementMetadata elementType metadata partitioning class colData rownames nrows listData [1] [2] elementType ... At the "Checking" step I got a warning with the same effect: ... * checking Rd cross-references ... OK * checking for missing documentation entries ... OK * checking for code/documentation mismatches ... WARNING [3] [4] [5] [6] names [2] [3] [4] [5] names dispersionFunction fitType varLogDispEsts dispPriorVar [1] [1] [1] [2] names class row.names [2] means class ... This warning only occurs if I store my RData files in the data directory. Here an example of the Class object I store in the RData file: #' @rdname dummyDataSet setClass("dummyDataSet", slots = c(dds = "list", genes = "GenomicRanges", bamFiles = "list", resultTables = "list", treatment = "character", nameAnalysis = "character", numberOfCores = "numeric"), validity = function(object) { ... }, contains = "dummySoftware") My class exists of some lists of data.frames (e.g. resultTables), lists of S4 objects (e.g. list of DESeq2 objects (dds)), S4 objects (e.g. genes) and more. Maybe this is the reason why I got this "strange" behavior? If you need some more information, just let me know. Here my sessionInfo: > sessionInfo() R version 3.6.1 (2019-07-05) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 18.04.3 LTS Matrix products: default BLAS: /home/x/Programme/R_Versions/R-3.6.1/lib/libRblas.so LAPACK: /home/x/Programme/R_Versions/R-3.6.1/lib/libRlapack.so locale: [1] LC_CTYPE=de_DE.UTF-8 LC_NUMERIC=C LC_TIME=de_DE.UTF-8 LC_COLLATE=de_DE.UTF-8 LC_MONETARY=de_DE.UTF-8 [6] LC_MESSAGES=de_DE.UTF-8 LC_PAPER=de_DE.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] dummyData_0.1.0 loaded via a namespace (and not attached): [1] compiler_3.6.1 tools_3.6.1 yaml_2.2.0 Best, Tobias Am 09.01.20 um 22:40 schrieb Pages, Herve: On 1/9/20 13:00, web working wrote: Hi Herve, thank you for your detailed answer. I guess I have expressed myself unclear. The BED files were just examples for data I store in the inst/extdata folder. Based on the description for ExperimentHubData I have decided to create a software and a data package (no ExperimentHubData software package). In my RData files I store software package objects. These objects are bigger than 5 MB. Using a helper function is no option, because the object calculation takes to much time. For this reason I want to load this objects for my example functions. My question is if the storage of my RData files in the inst/extdata directory is correct or not. It's technically correct but it's not as convenient as putting them in data/ because they can not longer be listed and/or loaded with data(). So if you're storing them in inst/extdata only because the data() solution gave you a BiocCheck warning then I'd say that you're giving up too easily ;-) IMO it is important to try to understand why the data() solution gave you a BiocCheck warning in the first place. Unfortunately you're not providing enough information for us to be able to tell. What does the warning say? How can we reproduce the warning? Ideally we would need to see a transcript of your session and links to your packages. Thanks, H. Best, Tobia
Re: [Bioc-devel] How to use RData files in Bioconductor data and software packages
Thank you for your example Kasper. The require option seems to be an option for me. I am following the Bioconductor "Circular Dependencies" Guidelines (https://github.com/Bioconductor/Contributions/blob/master/CONTRIBUTING.md#submitting-related-packages) to implement my software and my data package and using the "Suggest" and "Depends" connection. Am 15.01.20 um 00:12 schrieb Kasper Daniel Hansen: Tobias, When you use the data() command on the data package, you need to do library(dummyData) first (and you therefore need to Suggest: dummyData) Here is an example from minfi/minfiData if (require(minfiData)) { dat <- preprocessIllumina(RGsetEx, bg.correct=FALSE, normalize="controls") } Note how I use require to load the package. For clarity you could argue I should also have data(RGsetEx) but it is technically not necessary because of lazy loading. On Thu, Jan 9, 2020 at 4:40 PM Pages, Herve wrote: On 1/9/20 13:00, web working wrote: Hi Herve, thank you for your detailed answer. I guess I have expressed myself unclear. The BED files were just examples for data I store in the inst/extdata folder. Based on the description for ExperimentHubData I have decided to create a software and a data package (no ExperimentHubData software package). In my RData files I store software package objects. These objects are bigger than 5 MB. Using a helper function is no option, because the object calculation takes to much time. For this reason I want to load this objects for my example functions. My question is if the storage of my RData files in the inst/extdata directory is correct or not. It's technically correct but it's not as convenient as putting them in data/ because they can not longer be listed and/or loaded with data(). So if you're storing them in inst/extdata only because the data() solution gave you a BiocCheck warning then I'd say that you're giving up too easily ;-) IMO it is important to try to understand why the data() solution gave you a BiocCheck warning in the first place. Unfortunately you're not providing enough information for us to be able to tell. What does the warning say? How can we reproduce the warning? Ideally we would need to see a transcript of your session and links to your packages. Thanks, H. Best, Tobias Am 09.01.20 um 17:59 schrieb Pages, Herve: Hi Tobias, If the original data is in BED files, there should be no need to serialize the objects obtained by importing the files. It is **much** better to provide a small helper function that creates an object from a BED file and to use that function each time you need to load an object. This has at least 2 advantages: 1. It avoids redundant storage of the data. 2. By avoiding serialization of high-level S4 objects, it makes the package easier to maintain in the long run. Note that the helper function could also implement a cache mechanism (easy to do with an environment) so the BED file is only loaded and the object created the 1st time the function is called. On subsequent calls, the object is retrieved from the cache. However, if the BED files are really big (e.g. > 50 Mb), we require them to be stored on ExperimentHub instead of inside dummyData. Note that you still need to provide the dummyData package (which becomes what we call an ExperimentHub-based data package). See the "Creating An ExperimentHub Package" vignette in the ExperimentHubData package for more information about this. Hope this helps, H. On 1/9/20 04:45, web working wrote: Dear all, I am currently developing a software package (dummySoftware) and a data package (dummyData) and I am a bit confused in where to store my RData files in the data package. Here my situation: I want to store some software package objects (new class objects of the software package) in the data package. This objects are example objects and a to big for software packages. As I have read here ( https://urldefense.proofpoint.com/v2/url?u=http-3A__r-2Dpkgs.had.co.nz_data.html=DwICAg=eRAMFD45gAfqt84VtBcfhQ=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA=GaTKqVd_WDqMRk0dW7SYkjVlgCzt0I0bACHfb1iIOVc=0ajhWDlJfLxXxzJpreO1Nh4qnu3aJ8gQkRb9qThsi1o= ) all RData objects should be stored in the data directory of a package. BED files of the data package are stored in inst/extdata. The data of the data packaged will be addressed in the software package like this: system.file('extdata', 'subset.bed', package = 'dummyData'). And here the problem occurs. After building the data package (devtools::build(args = c('--resave-data'))), all data in data/ are stored in a datalist, Rdata.rdb, Rdata.rds and Rdata.rdx and can not addressed with system.file. Addressing this data with the data() function results in a warning during BiocCheck::BiocCheck(). My solution is to store the RData files in the inst/extdata directory and address them with system.file. Something similar is mentioned here, but in the context of a vignette
Re: [Bioc-devel] How to use RData files in Bioconductor data and software packages
Hi Richard, It depends on the filetype. I am loading my "non RData" files with read.delim and my RData files with a helper function which returns a R object of the RData object: #' Load RData object and returns first entry #' #' Load RData object and returns first entry. If there is more than one #' object in the RData file, only the first object will be returned. #' #' @param RDataFile a \code{character} vector to the RData file. #' #' @return The \code{R} object of the RData file. #' @export #' @examples #' # load GRanges object stored in a RData file. #' dummy.GRanges <- loadRData(system.file('extdata', 'dummy.RData', package = "dummyData")) loadRData <- function(RDataFile) { load(RDataFile) objectToLoad <- ls()[ls() != "RDataFile"] if (length(objectToLoad) > 1) warning(paste0("RData file contains more than one object. Only the first object (", objectToLoad[1], ") will be returned!")) get(objectToLoad[1]) } I know this is not the best solution. I guess saving the R objects in RDS files instead of RData files is the better solution here. My question is if my storage of the RData objects (or RDS objects) in the inst/extdata directory is ok for an Bioconductor package. Best, Tobias Am 09.01.20 um 15:45 schrieb Richard Virgen-Slane: > > I may be missing a point, but how are you loading the saved files? > > On Thu, Jan 9, 2020 at 4:46 AM web working <mailto:webwork...@posteo.de>> wrote: > > Dear all, > > I am currently developing a software package (dummySoftware) and a > data package (dummyData) and I am a bit confused in where to store > my RData files in the data package. Here my situation: > > I want to store some software package objects (new class objects > of the software package) in the data package. This objects are > example objects and a to big for software packages. As I have read > here (http://r-pkgs.had.co.nz/data.html) all RData objects should > be stored in the data directory of a package. BED files of the > data package are stored in inst/extdata. > The data of the data packaged will be addressed in the software > package like this: system.file('extdata', 'subset.bed', package = > 'dummyData'). And here the problem occurs. After building the data > package (devtools::build(args = c('--resave-data'))), all data in > data/ are stored in a datalist, Rdata.rdb, Rdata.rds and Rdata.rdx > and can not addressed with system.file. Addressing this data with > the data() function results in a warning during > BiocCheck::BiocCheck(). > > My solution is to store the RData files in the inst/extdata > directory and address them with system.file. Something similar is > mentioned here, but in the context of a vignette > (r-pkgs.had.co.nz/data.html#other-data > <http://r-pkgs.had.co.nz/data.html#other-data>). Is this the way > how to do it? > > Best, > Tobias > > ___ > Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org> mailing > list > https://stat.ethz.ch/mailman/listinfo/bioc-devel > [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] How to use RData files in Bioconductor data and software packages
Hi Herve, thank you for your detailed answer. I guess I have expressed myself unclear. The BED files were just examples for data I store in the inst/extdata folder. Based on the description for ExperimentHubData I have decided to create a software and a data package (no ExperimentHubData software package). In my RData files I store software package objects. These objects are bigger than 5 MB. Using a helper function is no option, because the object calculation takes to much time. For this reason I want to load this objects for my example functions. My question is if the storage of my RData files in the inst/extdata directory is correct or not. Best, Tobias Am 09.01.20 um 17:59 schrieb Pages, Herve: Hi Tobias, If the original data is in BED files, there should be no need to serialize the objects obtained by importing the files. It is **much** better to provide a small helper function that creates an object from a BED file and to use that function each time you need to load an object. This has at least 2 advantages: 1. It avoids redundant storage of the data. 2. By avoiding serialization of high-level S4 objects, it makes the package easier to maintain in the long run. Note that the helper function could also implement a cache mechanism (easy to do with an environment) so the BED file is only loaded and the object created the 1st time the function is called. On subsequent calls, the object is retrieved from the cache. However, if the BED files are really big (e.g. > 50 Mb), we require them to be stored on ExperimentHub instead of inside dummyData. Note that you still need to provide the dummyData package (which becomes what we call an ExperimentHub-based data package). See the "Creating An ExperimentHub Package" vignette in the ExperimentHubData package for more information about this. Hope this helps, H. On 1/9/20 04:45, web working wrote: Dear all, I am currently developing a software package (dummySoftware) and a data package (dummyData) and I am a bit confused in where to store my RData files in the data package. Here my situation: I want to store some software package objects (new class objects of the software package) in the data package. This objects are example objects and a to big for software packages. As I have read here (https://urldefense.proofpoint.com/v2/url?u=http-3A__r-2Dpkgs.had.co.nz_data.html=DwICAg=eRAMFD45gAfqt84VtBcfhQ=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA=GaTKqVd_WDqMRk0dW7SYkjVlgCzt0I0bACHfb1iIOVc=0ajhWDlJfLxXxzJpreO1Nh4qnu3aJ8gQkRb9qThsi1o= ) all RData objects should be stored in the data directory of a package. BED files of the data package are stored in inst/extdata. The data of the data packaged will be addressed in the software package like this: system.file('extdata', 'subset.bed', package = 'dummyData'). And here the problem occurs. After building the data package (devtools::build(args = c('--resave-data'))), all data in data/ are stored in a datalist, Rdata.rdb, Rdata.rds and Rdata.rdx and can not addressed with system.file. Addressing this data with the data() function results in a warning during BiocCheck::BiocCheck(). My solution is to store the RData files in the inst/extdata directory and address them with system.file. Something similar is mentioned here, but in the context of a vignette (r-pkgs.had.co.nz/data.html#other-data). Is this the way how to do it? Best, Tobias ___ Bioc-devel@r-project.org mailing list https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel=DwICAg=eRAMFD45gAfqt84VtBcfhQ=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA=GaTKqVd_WDqMRk0dW7SYkjVlgCzt0I0bACHfb1iIOVc=GYaoH8LeSP0tdY4PoOHEdDMGhzLC2gHcNGtKjVLZV-8= ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
[Bioc-devel] How to use RData files in Bioconductor data and software packages
Dear all, I am currently developing a software package (dummySoftware) and a data package (dummyData) and I am a bit confused in where to store my RData files in the data package. Here my situation: I want to store some software package objects (new class objects of the software package) in the data package. This objects are example objects and a to big for software packages. As I have read here (http://r-pkgs.had.co.nz/data.html) all RData objects should be stored in the data directory of a package. BED files of the data package are stored in inst/extdata. The data of the data packaged will be addressed in the software package like this: system.file('extdata', 'subset.bed', package = 'dummyData'). And here the problem occurs. After building the data package (devtools::build(args = c('--resave-data'))), all data in data/ are stored in a datalist, Rdata.rdb, Rdata.rds and Rdata.rdx and can not addressed with system.file. Addressing this data with the data() function results in a warning during BiocCheck::BiocCheck(). My solution is to store the RData files in the inst/extdata directory and address them with system.file. Something similar is mentioned here, but in the context of a vignette (r-pkgs.had.co.nz/data.html#other-data). Is this the way how to do it? Best, Tobias ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
[Bioc-devel] Roxygen
Hi together, I am struggling a bit with the roxygen function import. Do you know how to add the dplyr %>% functionality to a function without importing the hole dplyr package? This is my current solution: #' Return first 10 entries #' #' @param mtcars a \code{data.frame} object #' #' @import dplyr #' @return a \code{data.frame} with first 10 entries #' @export #' #' @examples #' fancyFunction(mtcars) fancyFunction <- function(mtcars){ output <- mtcars %>% head(10) return(output) } This is how I would like to have it: #' Return first 10 entries #' #' @param mtcars a \code{data.frame} object #' #' @importFrom dplyr %>% #' @return a \code{data.frame} with first 10 entries #' @export #' #' @examples #' fancyFunction(mtcars) fancyFunction <- function(mtcars){ output <- mtcars %>% head(10) return(output) } Do you have any idea how to deal with this problem? Best Tobias ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] problem with documentation of setMethod with different signatures
Hi Martin, thank you very much for this fast and detailed answer. Your solution works great for me. I agree with your point that roxygen requires a lot of manual effort to arrive a satisfactory solution. Especially when you use it the first time. Tobias Am 14.02.19 um 12:07 schrieb Martin Morgan: It's important to distinguish between problems in your code and problems in the tools (or use of tools) to process the code. Here the use of setGeneric() / setMethod() is correct. Your use of roxygen2 is getting in the way. You've used @rdname to document the methods on the same page (reducing the number of man pages probably helps the user, so this is good). But then on the same page you've documented `@param x`, for instance, twice, once for GRanges and once for integer. Here the solution is to document `@param x` in only one place, e.g., in the generic, with sentences that describe appropriate input for each method. #' @param x A `GRanges` instance or `integer` vector that... For other tags, e.g., `@details`, `@return`, multiple uses are concatenated into paragraphs in a single element in the Rd file; one might then say #' @details methodA,GRanges-method does one thing. ... #' @details methodA,integer-method does another. Generally, it is much harder to provide clear user-oriented documentation for S4 classes and methods, and roxygen requires a lot of manual effort to arrive at a satisfactory solution. You've already started down that road by using the @rdname tag to group methods documentation on the same page. While open for discussion and certainly dependent on context, I think it is more helpful to group documentation by class rather than generic, e.g, 'I have a GRanges object, what can I do with it?' rather than 'I wonder what I can find the `start()` of?'. Also the best documentation pages are really very carefully constructed, and these are very difficult to generate automatically from roxygen snippets associated with individual functions / methods; an approach might be to provide a single roxygen chunk at the top of a source file that documents the content of the source file, with individual methods etc restricted to tags `@rdname` and `@export`. The old school approach is to simply edit the man page by hand directly. Finally, it has proven very helpful to organize code, man pages, and unit tests in a parallel fashion R/My-class.R man/My-class.Rd tests/testthat/test_My-class.R The GenomicRanges package https://github.com/Bioconductor/GenomicRanges might be an advanced example of structure, though based on old-school manual construction of Rd files. Martin On 2/14/19, 4:08 AM, "Bioc-devel on behalf of web working" wrote: Hi, I am struggling a bit with a R generic function. I build a generic and set two implementations of the generic with two different signatures as input. Both implementations of the generic produce the same output but have a different input. During devtools::check() I get the following error: ❯ checking Rd \usage sections ... WARNING Duplicated \argument entries in documentation object 'methodA': ‘x’ ‘size’ Functions with \usage entries need to have the appropriate \alias entries, and all their arguments documented. The \usage entries must correspond to syntactically valid R code. See chapter ‘Writing R documentation files’ in the ‘Writing R Extensions’ manual. The original functions are complex so here are some dummy methods: #' methodA methods generic #' @rdname methodA-methods #' @export #' setGeneric("methodA", function(x, size = 1000) standardGeneric("methodA")) #' methodA method for \code{GRanges} input #' #' @param x a \code{GRanges} object #' @param size a \code{numeric} vector #' #' @import GenomicRanges #' @return a \code{list} object #' @rdname methodA-methods #' @export #' @examples #' library(GenomicRanges) #' dat.GRanges <- GRanges(seqnames=c(rep("chr1", 5), rep("chr2", 5)), #' IRanges(start = rep(c(1, 1, 55000, 55000, 15), 2), #' end = rep(c(2, 2, 7, 7, 60), 2))) #' out.list <- methodA(x = dat.GRanges, size = length(dat.GRanges)) #' setMethod(methodA, signature(x="GRanges"), function(x, size = 1000){ s <- start(x) return(list(s, size)) }) #' methodA method for named \code{integer} input #' #' @param x a \code{integer} vector #' @param size a \code{numeric} vector #' #' @return a \code{list} object #' @export #' @rdname methodA-methods #' @examples #' dat <- 1:20 #' out.list <- method
[Bioc-devel] problem with documentation of setMethod with different signatures
Hi, I am struggling a bit with a R generic function. I build a generic and set two implementations of the generic with two different signatures as input. Both implementations of the generic produce the same output but have a different input. During devtools::check() I get the following error: ❯ checking Rd \usage sections ... WARNING Duplicated \argument entries in documentation object 'methodA': ‘x’ ‘size’ Functions with \usage entries need to have the appropriate \alias entries, and all their arguments documented. The \usage entries must correspond to syntactically valid R code. See chapter ‘Writing R documentation files’ in the ‘Writing R Extensions’ manual. The original functions are complex so here are some dummy methods: #' methodA methods generic #' @rdname methodA-methods #' @export #' setGeneric("methodA", function(x, size = 1000) standardGeneric("methodA")) #' methodA method for \code{GRanges} input #' #' @param x a \code{GRanges} object #' @param size a \code{numeric} vector #' #' @import GenomicRanges #' @return a \code{list} object #' @rdname methodA-methods #' @export #' @examples #' library(GenomicRanges) #' dat.GRanges <- GRanges(seqnames=c(rep("chr1", 5), rep("chr2", 5)), #' IRanges(start = rep(c(1, 1, 55000, 55000, 15), 2), #' end = rep(c(2, 2, 7, 7, 60), 2))) #' out.list <- methodA(x = dat.GRanges, size = length(dat.GRanges)) #' setMethod(methodA, signature(x="GRanges"), function(x, size = 1000){ s <- start(x) return(list(s, size)) }) #' methodA method for named \code{integer} input #' #' @param x a \code{integer} vector #' @param size a \code{numeric} vector #' #' @return a \code{list} object #' @export #' @rdname methodA-methods #' @examples #' dat <- 1:20 #' out.list <- methodA(x = dat, size = length(dat)) setMethod(methodA, signature(x="integer"), function(x, size = 1000){ return(list(x, size)) }) The error above sounds absolute understandable for me, but I do not have a solution for this. Maybe using a generic is not the way to do this here? Tobias ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] ddply causes error during R check
Hi Martin, thank you for this approach. I will check my code and see where I can use it. Tobias Am 12.02.19 um 14:58 schrieb Martin Morgan: use `globalVariables()` to declare these symbols and quieten warnings, at the expense of quietening warnings about undefined variables in _all_ code and potentially silencing true positives. Avoid non-standard evaluation (this is what ddply is doing, using special rules to resolve symbols like `name`) by using base R functionality; note also that non-standard evaluation is prone to typos, e.g., looking for the typo `hpx` in the calling environment rather than the data frame hpx = 1 ddply(mtcars, "cyl", "summarize", value = mean(hpx)). ## oops, meant `mean(hp)`. cyl summarize 1 4 1 2 6 1 3 8 1 Marginally better is aggregate(hp ~ cyl, mtcars, mean) cylhp 1 4 82.63636 2 6 122.28571 3 8 209.21429 where R recognizes symbols in the formula ~ as intentionally unresolved. The wizards on the list might point to constructs in the rlang package. Martin On 2/12/19, 2:35 AM, "Bioc-devel on behalf of web working" wrote: Hi, I am developing a Bioconductor package and can not get rid of some warning messages. During devtools::check() I get the following warning messages: ... summarizeDataFrame: no visible binding for global variable ‘name’ summarizeDataFrame: no visible binding for global variable ‘gene’ summarizeDataFrame: no visible binding for global variable ‘value’ ... Here a short version of the function: #' Collapse rows with duplicated name column #' #' @param dat a \cite{tibble} with the columns name, gene and value #' @importFrom plyr ddply #' @import tibble #' @return a \cite{tibble} #' @export #' #' @examples #' dat <- tibble(name = c(paste0("position", 1:5), paste0("position", c(1:3))), gene = paste0("gene", 1:8), value = 1:8) #' summarizeDataFrame(dat) summarizeDataFrame <- function(dat){ ddply(dat, "name", "summarize", name=unique(name), gene=paste(unique(gene), collapse = ","), value=mean(value)) } R interprets the "name", "gene" and "value" column names as variables during the check. Does anyone has an idea how to change the syntax of ddply or how to get rid of the warning message? Thanks in advance! Tobias ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] ddply causes error during R check
Hi Mike, thank you for pointing out that there are other package which have the same situation. Tobias Am 12.02.19 um 14:47 schrieb Mike Smith: > If you're sure these are false positives (and it looks like they are) > then you can use utils::globalVariables() outside of your function to > get rid of the note. It might also be worth pointing out that there > are also plenty of Bioconductor packages that don't do this and simply > have this mentioned in the check results e.g > http://bioconductor.org/checkResults/devel/bioc-LATEST/beadarray/malbec2-checksrc.html > > . > > Mike > > On Tue, 12 Feb 2019 at 08:35, web working <mailto:webwork...@posteo.de>> wrote: > > Hi, > > I am developing a Bioconductor package and can not get rid of some > warning messages. During devtools::check() I get the following > warning > messages: > > ... > summarizeDataFrame: no visible binding for global variable ‘name’ > summarizeDataFrame: no visible binding for global variable ‘gene’ > summarizeDataFrame: no visible binding for global variable ‘value’ > ... > > Here a short version of the function: > > #' Collapse rows with duplicated name column > #' > #' @param dat a \cite{tibble} with the columns name, gene and value > #' @importFrom plyr ddply > #' @import tibble > #' @return a \cite{tibble} > #' @export > #' > #' @examples > #' dat <- tibble(name = c(paste0("position", 1:5), paste0("position", > c(1:3))), gene = paste0("gene", 1:8), value = 1:8) > #' summarizeDataFrame(dat) > summarizeDataFrame <- function(dat){ > ddply(dat, "name", "summarize", > name=unique(name), > gene=paste(unique(gene), collapse = ","), > value=mean(value)) > } > > R interprets the "name", "gene" and "value" column names as variables > during the check. Does anyone has an idea how to change the syntax of > ddply or how to get rid of the warning message? > > Thanks in advance! > > Tobias > > ___ > Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org> mailing > list > https://stat.ethz.ch/mailman/listinfo/bioc-devel > [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
[Bioc-devel] ddply causes error during R check
Hi, I am developing a Bioconductor package and can not get rid of some warning messages. During devtools::check() I get the following warning messages: ... summarizeDataFrame: no visible binding for global variable ‘name’ summarizeDataFrame: no visible binding for global variable ‘gene’ summarizeDataFrame: no visible binding for global variable ‘value’ ... Here a short version of the function: #' Collapse rows with duplicated name column #' #' @param dat a \cite{tibble} with the columns name, gene and value #' @importFrom plyr ddply #' @import tibble #' @return a \cite{tibble} #' @export #' #' @examples #' dat <- tibble(name = c(paste0("position", 1:5), paste0("position", c(1:3))), gene = paste0("gene", 1:8), value = 1:8) #' summarizeDataFrame(dat) summarizeDataFrame <- function(dat){ ddply(dat, "name", "summarize", name=unique(name), gene=paste(unique(gene), collapse = ","), value=mean(value)) } R interprets the "name", "gene" and "value" column names as variables during the check. Does anyone has an idea how to change the syntax of ddply or how to get rid of the warning message? Thanks in advance! Tobias ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel