Re: [Bioc-devel] Removal of Information in OrgDb generated from NCBI -- Feedback needed.

2020-04-30 Thread web working

Hi Lori,

I agree with Michael. I for myself do not need the UNIGENE field. The 
mot frequent columns that I use are:


ENTREZID, REFSEQ, ENSEMBL, SYMBOL, GENENAME, UNIPROT, gene_id and GO.

Best,

Tobias

Am 30.04.20 um 08:46 schrieb Stadler, Michael:

Hi Lori

Just my two-cents: I would not miss UNIGENE.

I am using org.db's mostly to annotate primary gene identifiers (ENTERZID, 
ENSEMBL) with
additional human readable information (SYMBOL, GENENAME), and to map between 
different
primary identifiers. I mostly use, in decreasing order of importance:
org.Hs.eg.db   org.Mm.eg.db   org.Rn.eg.db   org.Ce.eg.db   org.Dm.eg.db   
org.Sc.sgd.db

And the most frequent columns that I use are (again, decreasing order of 
importance):
ENTREZID ENSEMBL SYMBOL GENENAME UNIPROT GO

Maybe there is some indirect other usage that I am not aware of.

Best wishes,
Michael

-Ursprüngliche Nachricht-
Von: Bioc-devel  Im Auftrag von Shepherd, Lori
Gesendet: Mittwoch, 29. April 2020 19:50
An: Bioc-devel@r-project.org
Betreff: [Bioc-devel] Removal of Information in OrgDb generated from NCBI -- 
Feedback needed.

Hello Bioconductor maintainers

The core team was made aware of an issue with one of the make orgDb functions 
in AnnotationForge:


https://github.com/Bioconductor/AnnotationForge/issues/13


Investigating further, NCBI will no longer be updating the gene2unigene file. 
The url has moved to an ARCHIVE directory and there is an explanation and 
notice of retirement found here:


ftp://ftp.ncbi.nih.gov/repository/UniGene/README



We use this function in creating OrgDb's for the AnnotationHub as well as the 
recommended way for users to make custom OrgDb's from NCBI.

Temporarily we are working on updating the url to the new location but we are 
thinking of removing the gene2unigene data from the orgDbs. We would like to 
ask the community especially those that utilize the orgDb objects frequently if 
this data is still necessary and would the removal of UNIGENE cause a large 
disruption to current packages/functions/utilization of objects?

Any feedback is greatly appreciated.

Thank you



Lori Shepherd

Bioconductor Core Team

Roswell Park Comprehensive Cancer Center

Department of Biostatistics & Bioinformatics

Elm & Carlton Streets

Buffalo, New York 14263


This email message may contain legally privileged and/or confidential 
information.  If you are not the intended recipient(s), or the employee or 
agent responsible for the delivery of this message to the intended 
recipient(s), you are hereby notified that any disclosure, copying, 
distribution, or use of this email message is prohibited.  If you have received 
this message in error, please notify the sender immediately by e-mail and 
delete this email message from your computer. Thank you.
[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] How to speed up GRange comparision

2020-01-29 Thread web working
Hi Herve,

Thank you for your answer. pcompare works fine for me. Here my solution:

query <- GRanges(rep("chr1", 4), IRanges(c(1, 5, 9, 20), c(2, 6, 10, 22)))
subject <- GRanges(rep("chr1",4), IRanges(c(3, 1, 1, 15), c(4, 2, 2, 21)))
out <- vector("numeric", length(query))
out[(which(abs(pcompare(query, subject))<5))] <- 1
out

Carey was right that this here is off list. Next time I will pose my 
question on support.bioconductor.org <http://support.bioconductor.org>.

Best,

Tobias

Am 29.01.20 um 18:02 schrieb Pages, Herve:
> Yes poverlaps().
>
> Or pcompare(), which should be even faster. But only if you are not
> afraid to go low-level. See ?rangeComparisonCodeToLetter for the meaning
> of the codes returned by pcompare().
>
> H.
>
> On 1/29/20 08:01, Michael Lawrence via Bioc-devel wrote:
>> poverlaps()?
>>
>> On Wed, Jan 29, 2020 at 7:50 AM web working  wrote:
>>> Hello,
>>>
>>> I have two big GRanges objects and want to search for an overlap of  the
>>> first range of query with the first range of subject. Then take the
>>> second range of query and compare it with the second range of subject
>>> and so on. Here an example of my problem:
>>>
>>> # GRanges objects
>>> query <- GRanges(rep("chr1", 4), IRanges(c(1, 5, 9, 20), c(2, 6, 10,
>>> 22)), id=1:4)
>>> subject <- GRanges(rep("chr1",4), IRanges(c(3, 1, 1, 15), c(4, 2, 2,
>>> 21)), id=1:4)
>>>
>>> # The 2 overlaps at the first position should not be counted, because
>>> these ranges are at different rows.
>>> countOverlaps(query, subject)
>>>
>>> # Approach 1 (bad style. I have simplified it to understand)
>>> dat <- as.data.frame(findOverlaps(query, subject))
>>> indexDat <- apply(dat, 1, function(x) x[1]==x[2])
>>> indexBool <- dat[indexDat,1]
>>> out <- rep(FALSE, length(query))
>>> out[indexBool] <- TRUE
>>> as.numeric(out)
>>>
>>> # Approach 2 (bad style and takes too long)
>>> out <- vector("numeric", 4)
>>> for(i in seq_along(query)) out[i] <- (overlapsAny(query[i], subject[i]))
>>> out
>>>
>>> # Approach 3 (wrong results)
>>> as.numeric(overlapsAny(query, subject))
>>> as.numeric(overlapsAny(split(query, 1:4), split(subject, 1:4)))
>>>
>>>
>>> Maybe someone has an idea to speed this up?
>>>
>>>
>>> Best,
>>>
>>> Tobias
>>>
>>> ___
>>> Bioc-devel@r-project.org mailing list
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel=DwICAg=eRAMFD45gAfqt84VtBcfhQ=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA=FSrHBK59_OMc6EbEtcPhkTVO0cfDgSbQBGFOXWyHhjc=3tZpvRAw7T5dP21u32TRTf4lZ4QFLtmkouKR7TUlJws=
>>
>>

[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


[Bioc-devel] How to speed up GRange comparision

2020-01-29 Thread web working

Hello,

I have two big GRanges objects and want to search for an overlap of  the 
first range of query with the first range of subject. Then take the 
second range of query and compare it with the second range of subject 
and so on. Here an example of my problem:


# GRanges objects
query <- GRanges(rep("chr1", 4), IRanges(c(1, 5, 9, 20), c(2, 6, 10, 
22)), id=1:4)
subject <- GRanges(rep("chr1",4), IRanges(c(3, 1, 1, 15), c(4, 2, 2, 
21)), id=1:4)


# The 2 overlaps at the first position should not be counted, because 
these ranges are at different rows.

countOverlaps(query, subject)

# Approach 1 (bad style. I have simplified it to understand)
dat <- as.data.frame(findOverlaps(query, subject))
indexDat <- apply(dat, 1, function(x) x[1]==x[2])
indexBool <- dat[indexDat,1]
out <- rep(FALSE, length(query))
out[indexBool] <- TRUE
as.numeric(out)

# Approach 2 (bad style and takes too long)
out <- vector("numeric", 4)
for(i in seq_along(query)) out[i] <- (overlapsAny(query[i], subject[i]))
out

# Approach 3 (wrong results)
as.numeric(overlapsAny(query, subject))
as.numeric(overlapsAny(split(query, 1:4), split(subject, 1:4)))


Maybe someone has an idea to speed this up?


Best,

Tobias

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] How to use RData files in Bioconductor data and software packages

2020-01-16 Thread web working

Hi Herve,

thank you for your answer. To be honest I am fine if the data sets can 
not be loaded with data() solution.


Falsely I said a warning occurred during the BiocCheck. A warning 
occurred during the R check command of my data package. I used the 
recommended R CMD check environment flags for the check of my package 
(https://github.com/Bioconductor/Contributions/blob/master/CONTRIBUTING.md#r-cmd-check-environment) 
(devtools::check(document = FALSE, args = c('--no-build-vignettes'), 
build_args = c('--resave-data','--no-build-vignettes'))).


During the "Building" step of the R CMD check some "strange" behavior 
occurs. A lot of characters are printed at the screen:


...

─  checking for empty or unneeded directories

─  looking to see if a ‘data/datalist’ file should be added
 dispersionFunction
 fitType
 varLogDispEsts
 dispPriorVar
   dispFunction
   fit
   d
   means
   disps
   minDisp
   
 means
   class
   class
   class
   class
 rowRanges
 unlistData
 elementMetadata
 elementType
 metadata
 partitioning
 class
 colData
 rownames
 nrows
 listData
   [1]
   [2]
 elementType

...

At the "Checking" step I got a warning with the same effect:

...

* checking Rd cross-references ... OK
* checking for missing documentation entries ... OK
* checking for code/documentation mismatches ... WARNING

[3]
[4]
[5]
  
    [6]
  
  names
  [2]
  [3]
  [4]
  [5]
    names
    dispersionFunction
    fitType
    varLogDispEsts
    dispPriorVar
    [1]
  [1]
    [1]
    [2]
  names
  class
  row.names
  [2]
    
  means
class

...

This warning only occurs if I store my RData files in the data directory.

Here an example of the Class object I store in the RData file:

#' @rdname dummyDataSet
setClass("dummyDataSet", slots = c(dds = "list",
   genes = "GenomicRanges",
   bamFiles = "list",
   resultTables = "list",
   treatment = "character",
   nameAnalysis = "character",
   numberOfCores = "numeric"),
 validity = function(object) {
   ...
 }, contains = "dummySoftware")

My class exists of some lists of data.frames (e.g. resultTables), lists 
of S4 objects (e.g. list of DESeq2 objects (dds)), S4 objects (e.g. 
genes) and more. Maybe this is the reason why I got this "strange" behavior?


If you need some more information, just let me know.

Here my sessionInfo:

> sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.3 LTS

Matrix products: default
BLAS:   /home/x/Programme/R_Versions/R-3.6.1/lib/libRblas.so
LAPACK: /home/x/Programme/R_Versions/R-3.6.1/lib/libRlapack.so

locale:
 [1] LC_CTYPE=de_DE.UTF-8   LC_NUMERIC=C LC_TIME=de_DE.UTF-8    
LC_COLLATE=de_DE.UTF-8 LC_MONETARY=de_DE.UTF-8
 [6] LC_MESSAGES=de_DE.UTF-8    LC_PAPER=de_DE.UTF-8 
LC_NAME=C  LC_ADDRESS=C LC_TELEPHONE=C

[11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics  grDevices utils datasets  methods base

other attached packages:
[1] dummyData_0.1.0

loaded via a namespace (and not attached):
[1] compiler_3.6.1 tools_3.6.1    yaml_2.2.0

Best,

Tobias


Am 09.01.20 um 22:40 schrieb Pages, Herve:

On 1/9/20 13:00, web working wrote:

Hi Herve,

thank you for your detailed answer. I guess I have expressed myself
unclear. The BED files were just examples for data I store in the
inst/extdata folder. Based on the description for ExperimentHubData I
have decided to create a software and a data package (no
ExperimentHubData software package). In my RData files I store software
package objects. These objects are bigger than 5 MB. Using a helper
function is no option, because the object calculation takes to much
time. For this reason I want to load this objects for my example
functions. My question is if the storage of my RData files in the
inst/extdata directory is correct or not.

It's technically correct but it's not as convenient as putting them in
data/ because they can not longer be listed and/or loaded with data().
So if you're storing them in inst/extdata only because the data()
solution gave you a BiocCheck warning then I'd say that you're giving up
too easily ;-)

IMO it is important to try to understand why the data() solution gave
you a BiocCheck warning in the first place. Unfortunately you're not
providing enough information for us to be able to tell. What does the
warning say? How can we reproduce the warning? Ideally we would need to
see a transcript of your session and links to your packages.

Thanks,
H.



Best,

Tobia

Re: [Bioc-devel] How to use RData files in Bioconductor data and software packages

2020-01-16 Thread web working
Thank you for your example Kasper. The require option seems to be an 
option for me. I am following the Bioconductor "Circular Dependencies" 
Guidelines 
(https://github.com/Bioconductor/Contributions/blob/master/CONTRIBUTING.md#submitting-related-packages) 
to implement my software and my data package and using the "Suggest" and 
"Depends" connection.


Am 15.01.20 um 00:12 schrieb Kasper Daniel Hansen:

Tobias,

When you use the data() command on the data package, you need to do
   library(dummyData)
first (and you therefore need to Suggest: dummyData)

Here is an example from minfi/minfiData

if (require(minfiData)) {
   dat <- preprocessIllumina(RGsetEx, bg.correct=FALSE, normalize="controls")
}

Note how I use require to load the package. For clarity you could argue I
should also have
   data(RGsetEx)
but it is technically not necessary because of lazy loading.





On Thu, Jan 9, 2020 at 4:40 PM Pages, Herve  wrote:


On 1/9/20 13:00, web working wrote:

Hi Herve,

thank you for your detailed answer. I guess I have expressed myself
unclear. The BED files were just examples for data I store in the
inst/extdata folder. Based on the description for ExperimentHubData I
have decided to create a software and a data package (no
ExperimentHubData software package). In my RData files I store software
package objects. These objects are bigger than 5 MB. Using a helper
function is no option, because the object calculation takes to much
time. For this reason I want to load this objects for my example
functions. My question is if the storage of my RData files in the
inst/extdata directory is correct or not.

It's technically correct but it's not as convenient as putting them in
data/ because they can not longer be listed and/or loaded with data().
So if you're storing them in inst/extdata only because the data()
solution gave you a BiocCheck warning then I'd say that you're giving up
too easily ;-)

IMO it is important to try to understand why the data() solution gave
you a BiocCheck warning in the first place. Unfortunately you're not
providing enough information for us to be able to tell. What does the
warning say? How can we reproduce the warning? Ideally we would need to
see a transcript of your session and links to your packages.

Thanks,
H.



Best,

Tobias

Am 09.01.20 um 17:59 schrieb Pages, Herve:

Hi Tobias,

If the original data is in BED files, there should be no need to
serialize the objects obtained by importing the files. It is **much**
better to provide a small helper function that creates an object from a
BED file and to use that function each time you need to load an object.

This has at least 2 advantages:
1. It avoids redundant storage of the data.
2. By avoiding serialization of high-level S4 objects, it makes the
package easier to maintain in the long run.

Note that the helper function could also implement a cache mechanism
(easy to do with an environment) so the BED file is only loaded and the
object created the 1st time the function is called. On subsequent calls,
the object is retrieved from the cache.

However, if the BED files are really big (e.g. > 50 Mb), we require them
to be stored on ExperimentHub instead of inside dummyData. Note that you
still need to provide the dummyData package (which becomes what we call
an ExperimentHub-based data package). See the "Creating An ExperimentHub
Package" vignette in the ExperimentHubData package for more information
about this.

Hope this helps,

H.

On 1/9/20 04:45, web working wrote:

Dear all,

I am currently developing a software package (dummySoftware) and a data
package (dummyData) and I am a bit confused in where to store my RData
files in the data package. Here my situation:

I want to store some software package objects (new class objects of the
software package) in the data package. This objects are example objects
and a to big for software packages. As I have read here
(

https://urldefense.proofpoint.com/v2/url?u=http-3A__r-2Dpkgs.had.co.nz_data.html=DwICAg=eRAMFD45gAfqt84VtBcfhQ=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA=GaTKqVd_WDqMRk0dW7SYkjVlgCzt0I0bACHfb1iIOVc=0ajhWDlJfLxXxzJpreO1Nh4qnu3aJ8gQkRb9qThsi1o=

) all RData objects should be stored in the data directory of a

package.

BED files of the data package are stored in inst/extdata.
The data of the data packaged will be addressed in the software package
like this: system.file('extdata', 'subset.bed', package = 'dummyData').
And here the problem occurs. After building the data package
(devtools::build(args = c('--resave-data'))), all data in data/ are
stored in a datalist, Rdata.rdb, Rdata.rds and Rdata.rdx and can not
addressed with system.file. Addressing this data with the data()
function results in a warning during BiocCheck::BiocCheck().

My solution is to store the RData files in the inst/extdata directory
and address them with system.file. Something similar is mentioned here,
but in the context of a vignette

Re: [Bioc-devel] How to use RData files in Bioconductor data and software packages

2020-01-09 Thread web working
Hi Richard,

It depends on the filetype. I am loading my "non RData" files with 
read.delim and my RData files with a helper function which returns a R 
object of the RData object:

#' Load RData object and returns first entry
#'
#' Load RData object and returns first entry. If there is more than one
#' object in the RData file, only the first object will be returned.
#'
#' @param RDataFile a \code{character} vector to the RData file.
#'
#' @return The \code{R} object of the RData file.
#' @export
#' @examples
#' # load GRanges object stored in a RData file.
#' dummy.GRanges <- loadRData(system.file('extdata', 'dummy.RData', 
package = "dummyData"))
loadRData <- function(RDataFile) {
   load(RDataFile)
   objectToLoad <- ls()[ls() != "RDataFile"]
   if (length(objectToLoad) > 1)
     warning(paste0("RData file contains more than one object. Only the 
first object (",
    objectToLoad[1], ") will be returned!"))
   get(objectToLoad[1])
}

I know this is not the best solution. I guess saving the R objects in 
RDS files instead of RData files is the better solution here.

My question is if my storage of the RData objects (or RDS objects) in 
the inst/extdata directory is ok for an Bioconductor package.

Best,

Tobias


Am 09.01.20 um 15:45 schrieb Richard Virgen-Slane:
>
> I may be missing a point, but how are you loading the saved files?
>
> On Thu, Jan 9, 2020 at 4:46 AM web working  <mailto:webwork...@posteo.de>> wrote:
>
> Dear all,
>
> I am currently developing a software package (dummySoftware) and a
> data package (dummyData) and I am a bit confused in where to store
> my RData files in the data package. Here my situation:
>
> I want to store some software package objects (new class objects
> of the software package) in the data package. This objects are
> example objects and a to big for software packages. As I have read
> here (http://r-pkgs.had.co.nz/data.html) all RData objects should
> be stored in the data directory of a package. BED files of the
> data package are stored in inst/extdata.
> The data of the data packaged will be addressed in the software
> package like this: system.file('extdata', 'subset.bed', package =
> 'dummyData'). And here the problem occurs. After building the data
> package (devtools::build(args = c('--resave-data'))), all data in
> data/ are stored in a datalist, Rdata.rdb, Rdata.rds and Rdata.rdx
> and can not addressed with system.file. Addressing this data with
> the data() function results in a warning during
> BiocCheck::BiocCheck().
>
> My solution is to store the RData files in the inst/extdata
> directory and address them with system.file. Something similar is
> mentioned here, but in the context of a vignette
> (r-pkgs.had.co.nz/data.html#other-data
> <http://r-pkgs.had.co.nz/data.html#other-data>). Is this the way
> how to do it?
>
> Best,
> Tobias
>
> ___
> Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org> mailing
> list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>

[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] How to use RData files in Bioconductor data and software packages

2020-01-09 Thread web working

Hi Herve,

thank you for your detailed answer. I guess I have expressed myself 
unclear. The BED files were just examples for data I store in the 
inst/extdata folder. Based on the description for ExperimentHubData I 
have decided to create a software and a data package (no 
ExperimentHubData software package). In my RData files I store software 
package objects. These objects are bigger than 5 MB. Using a helper 
function is no option, because the object calculation takes to much 
time. For this reason I want to load this objects for my example 
functions. My question is if the storage of my RData files in the 
inst/extdata directory is correct or not.


Best,

Tobias

Am 09.01.20 um 17:59 schrieb Pages, Herve:

Hi Tobias,

If the original data is in BED files, there should be no need to
serialize the objects obtained by importing the files. It is **much**
better to provide a small helper function that creates an object from a
BED file and to use that function each time you need to load an object.

This has at least 2 advantages:
1. It avoids redundant storage of the data.
2. By avoiding serialization of high-level S4 objects, it makes the
package easier to maintain in the long run.

Note that the helper function could also implement a cache mechanism
(easy to do with an environment) so the BED file is only loaded and the
object created the 1st time the function is called. On subsequent calls,
the object is retrieved from the cache.

However, if the BED files are really big (e.g. > 50 Mb), we require them
to be stored on ExperimentHub instead of inside dummyData. Note that you
still need to provide the dummyData package (which becomes what we call
an ExperimentHub-based data package). See the "Creating An ExperimentHub
Package" vignette in the ExperimentHubData package for more information
about this.

Hope this helps,

H.

On 1/9/20 04:45, web working wrote:

Dear all,

I am currently developing a software package (dummySoftware) and a data
package (dummyData) and I am a bit confused in where to store my RData
files in the data package. Here my situation:

I want to store some software package objects (new class objects of the
software package) in the data package. This objects are example objects
and a to big for software packages. As I have read here
(https://urldefense.proofpoint.com/v2/url?u=http-3A__r-2Dpkgs.had.co.nz_data.html=DwICAg=eRAMFD45gAfqt84VtBcfhQ=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA=GaTKqVd_WDqMRk0dW7SYkjVlgCzt0I0bACHfb1iIOVc=0ajhWDlJfLxXxzJpreO1Nh4qnu3aJ8gQkRb9qThsi1o=
) all RData objects should be stored in the data directory of a package.
BED files of the data package are stored in inst/extdata.
The data of the data packaged will be addressed in the software package
like this: system.file('extdata', 'subset.bed', package = 'dummyData').
And here the problem occurs. After building the data package
(devtools::build(args = c('--resave-data'))), all data in data/ are
stored in a datalist, Rdata.rdb, Rdata.rds and Rdata.rdx and can not
addressed with system.file. Addressing this data with the data()
function results in a warning during BiocCheck::BiocCheck().

My solution is to store the RData files in the inst/extdata directory
and address them with system.file. Something similar is mentioned here,
but in the context of a vignette
(r-pkgs.had.co.nz/data.html#other-data). Is this the way how to do it?

Best,
Tobias

___
Bioc-devel@r-project.org mailing list
https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel=DwICAg=eRAMFD45gAfqt84VtBcfhQ=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA=GaTKqVd_WDqMRk0dW7SYkjVlgCzt0I0bACHfb1iIOVc=GYaoH8LeSP0tdY4PoOHEdDMGhzLC2gHcNGtKjVLZV-8=



___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


[Bioc-devel] How to use RData files in Bioconductor data and software packages

2020-01-09 Thread web working

Dear all,

I am currently developing a software package (dummySoftware) and a data package 
(dummyData) and I am a bit confused in where to store my RData files in the 
data package. Here my situation:

I want to store some software package objects (new class objects of the 
software package) in the data package. This objects are example objects and a 
to big for software packages. As I have read here 
(http://r-pkgs.had.co.nz/data.html) all RData objects should be stored in the 
data directory of a package. BED files of the data package are stored in 
inst/extdata.
The data of the data packaged will be addressed in the software package like 
this: system.file('extdata', 'subset.bed', package = 'dummyData'). And here the 
problem occurs. After building the data package (devtools::build(args = 
c('--resave-data'))), all data in data/ are stored in a datalist, Rdata.rdb, 
Rdata.rds and Rdata.rdx and can not addressed with system.file. Addressing this 
data with the data() function results in a warning during 
BiocCheck::BiocCheck().

My solution is to store the RData files in the inst/extdata directory and 
address them with system.file. Something similar is mentioned here, but in the 
context of a vignette (r-pkgs.had.co.nz/data.html#other-data). Is this the way 
how to do it?

Best,
Tobias

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


[Bioc-devel] Roxygen

2019-05-09 Thread web working

Hi together,

I am struggling a bit with the roxygen function import. Do you know how 
to add the dplyr %>% functionality to a function without importing the 
hole dplyr package?


This is my current solution:

#' Return first 10 entries
#'
#' @param mtcars a \code{data.frame} object
#'
#' @import dplyr
#' @return a \code{data.frame} with first 10 entries
#' @export
#'
#' @examples
#' fancyFunction(mtcars)
fancyFunction <- function(mtcars){
  output <- mtcars %>% head(10)
  return(output)
}


This is how I would like to have it:

#' Return first 10 entries
#'
#' @param mtcars a \code{data.frame} object
#'
#' @importFrom dplyr %>%
#' @return a \code{data.frame} with first 10 entries
#' @export
#'
#' @examples
#' fancyFunction(mtcars)
fancyFunction <- function(mtcars){
  output <- mtcars %>% head(10)
  return(output)
}


Do you have any idea how to deal with this problem?

Best

Tobias

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] problem with documentation of setMethod with different signatures

2019-02-18 Thread web working

Hi Martin,

thank you very much for this fast and detailed answer. Your solution 
works great for me. I agree with your point that roxygen requires a lot 
of manual effort to arrive a satisfactory solution. Especially when you 
use it the first time.


Tobias

Am 14.02.19 um 12:07 schrieb Martin Morgan:

It's important to distinguish between problems in your code and problems in the 
tools (or use of tools) to process the code. Here the use of setGeneric() / 
setMethod() is correct. Your use of roxygen2 is getting in the way.

You've used @rdname to document the methods on the same page (reducing the 
number of man pages probably helps the user, so this is good). But then on the 
same page you've documented `@param x`, for instance, twice, once for GRanges 
and once for integer. Here the solution is to document `@param x` in only one 
place, e.g., in the generic, with sentences that describe appropriate input for 
each method.

#' @param x A `GRanges` instance or `integer` vector that...

For other tags, e.g., `@details`, `@return`, multiple uses are concatenated 
into paragraphs in a single element in the Rd file; one might then say

#' @details methodA,GRanges-method does one thing.
...
#' @details methodA,integer-method does another.

Generally, it is much harder to provide clear user-oriented documentation for 
S4 classes and methods, and roxygen requires a lot of manual effort to arrive 
at a satisfactory solution. You've already started down that road by using the 
@rdname tag to group methods documentation on the same page.

While open for discussion and certainly dependent on context, I think it is 
more helpful to group documentation by class rather than generic, e.g, 'I have 
a GRanges object, what can I do with it?' rather than 'I wonder what I can find 
the `start()` of?'. Also the best documentation pages are really very carefully 
constructed, and these are very difficult to generate automatically from 
roxygen snippets associated with individual functions / methods; an approach 
might be to provide a single roxygen chunk at the top of a source file that 
documents the content of the source file, with individual methods etc 
restricted to tags `@rdname` and `@export`. The old school approach is to 
simply edit the man page by hand directly.

Finally, it has proven very helpful to organize code, man pages, and unit tests 
in a parallel fashion

R/My-class.R
   man/My-class.Rd
   tests/testthat/test_My-class.R

The GenomicRanges package https://github.com/Bioconductor/GenomicRanges might 
be an advanced example of structure, though based on old-school manual 
construction of Rd files.

Martin

On 2/14/19, 4:08 AM, "Bioc-devel on behalf of web working" 
 wrote:

 Hi,
 
 I am struggling a bit with a R generic function. I build a generic and

 set two implementations of the generic with two different signatures as
 input. Both implementations of the generic produce the same output but
 have a different input. During devtools::check() I get the following error:
 
 ❯ checking Rd \usage sections ... WARNING

Duplicated \argument entries in documentation object 'methodA':
  ‘x’ ‘size’
 
   Functions with \usage entries need to have the appropriate \alias

entries, and all their arguments documented.
The \usage entries must correspond to syntactically valid R code.
See chapter ‘Writing R documentation files’ in the ‘Writing R
Extensions’ manual.
 
 
 The original functions are complex so here are some dummy methods:
 
 
 #' methodA methods generic

 #' @rdname methodA-methods
 #' @export
 #'
 setGeneric("methodA", function(x, size = 1000)
standardGeneric("methodA"))
 
 
 #' methodA method for \code{GRanges} input

 #'
 #' @param x a \code{GRanges} object
 #' @param size a \code{numeric} vector
 #'
 #' @import GenomicRanges
 #' @return a \code{list} object
 #' @rdname methodA-methods
 #' @export
 #' @examples
 #' library(GenomicRanges)
 #' dat.GRanges <- GRanges(seqnames=c(rep("chr1", 5), rep("chr2", 5)),
 #' IRanges(start = rep(c(1, 1, 55000, 55000, 15), 2),
 #' end = rep(c(2, 2, 7, 7, 60), 2)))
 #' out.list <- methodA(x = dat.GRanges, size = length(dat.GRanges))
 #'
 setMethod(methodA, signature(x="GRanges"),
function(x, size = 1000){
  s <- start(x)
  return(list(s, size))
})
 
 
 
 #' methodA method for named \code{integer} input

 #'
 #' @param x a \code{integer} vector
 #' @param size a \code{numeric} vector
 #'
 #' @return a \code{list} object
 #' @export
 #' @rdname methodA-methods
 #' @examples
 #' dat <- 1:20
 #' out.list <- method

[Bioc-devel] problem with documentation of setMethod with different signatures

2019-02-14 Thread web working

Hi,

I am struggling a bit with a R generic function. I build a generic and 
set two implementations of the generic with two different signatures as 
input. Both implementations of the generic produce the same output but 
have a different input. During devtools::check() I get the following error:


❯ checking Rd \usage sections ... WARNING
  Duplicated \argument entries in documentation object 'methodA':
    ‘x’ ‘size’

 Functions with \usage entries need to have the appropriate \alias
  entries, and all their arguments documented.
  The \usage entries must correspond to syntactically valid R code.
  See chapter ‘Writing R documentation files’ in the ‘Writing R
  Extensions’ manual.


The original functions are complex so here are some dummy methods:


#' methodA methods generic
#' @rdname methodA-methods
#' @export
#'
setGeneric("methodA", function(x, size = 1000)
  standardGeneric("methodA"))


#' methodA method for \code{GRanges} input
#'
#' @param x a \code{GRanges} object
#' @param size a \code{numeric} vector
#'
#' @import GenomicRanges
#' @return a \code{list} object
#' @rdname methodA-methods
#' @export
#' @examples
#' library(GenomicRanges)
#' dat.GRanges <- GRanges(seqnames=c(rep("chr1", 5), rep("chr2", 5)),
#' IRanges(start = rep(c(1, 1, 55000, 55000, 15), 2),
#' end = rep(c(2, 2, 7, 7, 60), 2)))
#' out.list <- methodA(x = dat.GRanges, size = length(dat.GRanges))
#'
setMethod(methodA, signature(x="GRanges"),
  function(x, size = 1000){
    s <- start(x)
    return(list(s, size))
  })



#' methodA method for named \code{integer} input
#'
#' @param x a \code{integer} vector
#' @param size a \code{numeric} vector
#'
#' @return a \code{list} object
#' @export
#' @rdname methodA-methods
#' @examples
#' dat <- 1:20
#' out.list <- methodA(x = dat, size = length(dat))
setMethod(methodA, signature(x="integer"),
  function(x, size = 1000){
    return(list(x, size))
  })

The error above sounds absolute understandable for me, but I do not have 
a solution for this. Maybe using a generic is not the way to do this here?


Tobias

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] ddply causes error during R check

2019-02-14 Thread web working

Hi Martin,

thank you for this approach. I will check my code and see where I can 
use it.


Tobias

Am 12.02.19 um 14:58 schrieb Martin Morgan:

use `globalVariables()` to declare these symbols and quieten warnings, at the 
expense of quietening warnings about undefined variables in _all_ code and 
potentially silencing true positives. Avoid non-standard evaluation (this is 
what ddply is doing, using special rules to resolve symbols like `name`) by 
using base R functionality; note also that non-standard evaluation is prone to 
typos, e.g., looking for the typo `hpx` in the calling environment rather than 
the data frame


hpx = 1
ddply(mtcars, "cyl", "summarize", value = mean(hpx)).  ## oops, meant 
`mean(hp)`.

   cyl summarize
1   4 1
2   6 1
3   8 1

Marginally better is


aggregate(hp ~ cyl, mtcars, mean)

   cylhp
1   4  82.63636
2   6 122.28571
3   8 209.21429

where R recognizes symbols in the formula ~ as intentionally unresolved. The 
wizards on the list might point to constructs in the rlang package.

Martin

On 2/12/19, 2:35 AM, "Bioc-devel on behalf of web working" 
 wrote:

 Hi,
 
 I am developing a Bioconductor package and can not get rid of some

 warning messages. During devtools::check() I get the following warning
 messages:
 
 ...

 summarizeDataFrame: no visible binding for global variable ‘name’
 summarizeDataFrame: no visible binding for global variable ‘gene’
 summarizeDataFrame: no visible binding for global variable ‘value’
 ...
 
 Here a short version of the function:
 
 #' Collapse rows with duplicated name column

 #'
 #' @param dat a \cite{tibble} with the columns name, gene and value
 #' @importFrom plyr ddply
 #' @import tibble
 #' @return a \cite{tibble}
 #' @export
 #'
 #' @examples
 #' dat <- tibble(name = c(paste0("position", 1:5), paste0("position",
 c(1:3))), gene = paste0("gene", 1:8), value = 1:8)
 #' summarizeDataFrame(dat)
 summarizeDataFrame <- function(dat){
ddply(dat, "name", "summarize",
  name=unique(name),
  gene=paste(unique(gene), collapse = ","),
  value=mean(value))
 }
 
 R interprets the "name", "gene" and "value" column names as variables

 during the check. Does anyone has an idea how to change the syntax of
 ddply or how to get rid of the warning message?
 
 Thanks in advance!
 
 Tobias
 
 ___

 Bioc-devel@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/bioc-devel



___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] ddply causes error during R check

2019-02-14 Thread web working
Hi Mike,

thank you for pointing out that there are other package which have the 
same situation.

Tobias

Am 12.02.19 um 14:47 schrieb Mike Smith:
> If you're sure these are false positives (and it looks like they are) 
> then you can use utils::globalVariables() outside of your function to 
> get rid of the note.  It might also be worth pointing out that there 
> are also plenty of Bioconductor packages that don't do this and simply 
> have this mentioned in the check results e.g 
> http://bioconductor.org/checkResults/devel/bioc-LATEST/beadarray/malbec2-checksrc.html
>  
> .
>
> Mike
>
> On Tue, 12 Feb 2019 at 08:35, web working  <mailto:webwork...@posteo.de>> wrote:
>
> Hi,
>
> I am developing a Bioconductor package and can not get rid of some
> warning messages. During devtools::check() I get the following
> warning
> messages:
>
> ...
> summarizeDataFrame: no visible binding for global variable ‘name’
> summarizeDataFrame: no visible binding for global variable ‘gene’
> summarizeDataFrame: no visible binding for global variable ‘value’
> ...
>
> Here a short version of the function:
>
> #' Collapse rows with duplicated name column
> #'
> #' @param dat a \cite{tibble} with the columns name, gene and value
> #' @importFrom plyr ddply
> #' @import tibble
> #' @return a \cite{tibble}
> #' @export
> #'
> #' @examples
> #' dat <- tibble(name = c(paste0("position", 1:5), paste0("position",
> c(1:3))), gene = paste0("gene", 1:8), value = 1:8)
> #' summarizeDataFrame(dat)
> summarizeDataFrame <- function(dat){
>    ddply(dat, "name", "summarize",
>  name=unique(name),
>  gene=paste(unique(gene), collapse = ","),
>  value=mean(value))
> }
>
> R interprets the "name", "gene" and "value" column names as variables
> during the check. Does anyone has an idea how to change the syntax of
> ddply or how to get rid of the warning message?
>
> Thanks in advance!
>
> Tobias
>
> ___
> Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org> mailing
> list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>

[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


[Bioc-devel] ddply causes error during R check

2019-02-11 Thread web working

Hi,

I am developing a Bioconductor package and can not get rid of some 
warning messages. During devtools::check() I get the following warning 
messages:


...
summarizeDataFrame: no visible binding for global variable ‘name’
summarizeDataFrame: no visible binding for global variable ‘gene’
summarizeDataFrame: no visible binding for global variable ‘value’
...

Here a short version of the function:

#' Collapse rows with duplicated name column
#'
#' @param dat a \cite{tibble} with the columns name, gene and value
#' @importFrom plyr ddply
#' @import tibble
#' @return a \cite{tibble}
#' @export
#'
#' @examples
#' dat <- tibble(name = c(paste0("position", 1:5), paste0("position", 
c(1:3))), gene = paste0("gene", 1:8), value = 1:8)

#' summarizeDataFrame(dat)
summarizeDataFrame <- function(dat){
  ddply(dat, "name", "summarize",
    name=unique(name),
    gene=paste(unique(gene), collapse = ","),
    value=mean(value))
}

R interprets the "name", "gene" and "value" column names as variables 
during the check. Does anyone has an idea how to change the syntax of 
ddply or how to get rid of the warning message?


Thanks in advance!

Tobias

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel