Re: [Bioc-devel] read_bed()

Bhagwat, Aditya Wed, 18 Sep 2019 04:58:52 -0700

Hi Michael, 

That's a software design dilemma I've encountered a few times.


One approach is to keep the "verb" functions bare. E.g. read_bed would only 
read a bedfile, and plot_bed would somehow plot it. Advantage: if read_bed 
doesn't depend on anything else, other functions can depend on it, which makes 
dependency handling easier.

Another intention is to make verb functions "intuitive". In that scenario, I 
try for each operation to also output a visual image of the operation, to make 
it easier to see at a glance what is happening. E.g. for the range operations 
in multicrispr, the function plot_intervals visually shows what operation is 
being performed, making it easier to both spot errors as well as maintain focus.

In the case of read_bed, I thought of wrapping around your excellent core-level 
rtracklayer::import(), additionally providing the textual and visual feedback 
which I intent to give.

Interesting to hear your suggestions on this topic, though.

Aditya


________________________________________
From: Michael Lawrence [lawrence.mich...@gene.com]
Sent: Wednesday, September 18, 2019 1:33 PM
To: Bhagwat, Aditya
Cc: Michael Lawrence; bioc-devel@r-project.org
Subject: Re: [Bioc-devel] read_bed()

I'm not sure if a function called read_bed() should be plotting or
printing. Is your BED file a known BED variant, i.e., maybe there is a
better name for the file type than "bed"?


On Wed, Sep 18, 2019 at 3:17 AM Bhagwat, Aditya
<aditya.bhag...@mpi-bn.mpg.de> wrote:
>
> Actually,
>
> I will keep multicrispr::read_bed(), but wrap it around 
> rtracklayer::import.bed, and additionally plot and print range summaries.
>
> Aditya
>
> ________________________________________
> From: Bioc-devel [bioc-devel-boun...@r-project.org] on behalf of Bhagwat, 
> Aditya [aditya.bhag...@mpi-bn.mpg.de]
> Sent: Wednesday, September 18, 2019 11:31 AM
> To: Michael Lawrence
> Cc: bioc-devel@r-project.org
> Subject: Re: [Bioc-devel] read_bed()
>
> (Typo corrected to avoid confusion)
>
> Michael,
>
> rtracklayer::import.bed() indeed works perfectly for me, so I am dropping 
> multicrispr::read_bed().
>
> In order to avoid the overkill of `require(tracklayer)` for multicrispr 
> <https://gitlab.gwdg.de/loosolab/software/multicrispr> users, does it make 
> sense to import/re-export import.bed() in multicrispr? What is BioC 
> convention/best practice in such cases?
>
> Aditya
>
>
>
> ________________________________________
> From: Bioc-devel [bioc-devel-boun...@r-project.org] on behalf of Bhagwat, 
> Aditya [aditya.bhag...@mpi-bn.mpg.de]
> Sent: Wednesday, September 18, 2019 8:35 AM
> To: Michael Lawrence
> Cc: bioc-devel@r-project.org
> Subject: Re: [Bioc-devel] read_bed()
>
> Thank you Michael :-)
>
> Aditya
> ________________________________________
> From: Michael Lawrence [lawrence.mich...@gene.com]
> Sent: Tuesday, September 17, 2019 8:49 PM
> To: Bhagwat, Aditya
> Cc: Michael Lawrence; bioc-devel@r-project.org
> Subject: Re: [Bioc-devel] read_bed()
>
> I think you probably made a mistake when dropping the columns. When I
> provide the extraCols= argument (inventing my own names for things),
> it almost works, but it breaks due to NAs in the extra columns. The
> "." character is the standard way to express NA in BED files. I've
> added support for extra na.strings to version 1.45.6.
>
> For reference, the call is like:
>
> import("SRF.bed", extraCols=c(chr2="character", start2="integer",
> end2="integer", mDux="factor", type="factor", pos1="integer",
> pos2="integer", strand2="factor", from="factor", n="integer",
> code="character", anno="factor", id="character", biotype="character",
> score2="numeric" ), na.strings="NA")
>
>
> On Tue, Sep 17, 2019 at 7:23 AM Bhagwat, Aditya
> <aditya.bhag...@mpi-bn.mpg.de> wrote:
> >
> > Hi Michael,
> >
> > I removed the additional metadata columns in SRF.bed
> > https://gitlab.gwdg.de/loosolab/software/multicrispr/blob/master/inst/extdata/SRF.bed
> >
> > But still can't get rtracklayer::import.bed working:
> >
> > > rtracklayer::import.bed(bedfile)
> > Error in scan(file = file, what = what, sep = sep, quote = quote, dec = 
> > dec, : scan() expected 'a real', got '1.168.595'
> > > bedfile
> > [1] 
> > "C:/Users/abhagwa/Documents/R/R-3.6.1/library/multicrispr/extdata/SRF.bed"
> >
> > Never mind, multicrispr function read_bed, based on data.table::fread is 
> > doing the job, so I will stick to that .
> >
> > Thank you for all feedback,
> >
> > Cheers,
> >
> > Aditya
> >
> >
> > ________________________________________
> > From: Bioc-devel [bioc-devel-boun...@r-project.org] on behalf of Bhagwat, 
> > Aditya [aditya.bhag...@mpi-bn.mpg.de]
> > Sent: Tuesday, September 17, 2019 2:48 PM
> > To: Michael Lawrence
> > Cc: bioc-devel@r-project.org
> > Subject: Re: [Bioc-devel] read_bed()
> >
> > Oh :-) - Thankyou for explaining!
> > ________________________________________
> > From: Michael Lawrence [lawrence.mich...@gene.com]
> > Sent: Tuesday, September 17, 2019 2:40 PM
> > To: Bhagwat, Aditya
> > Cc: Michael Lawrence; Shepherd, Lori; bioc-devel@r-project.org
> > Subject: Re: [Bioc-devel] read_bed()
> >
> > Having a "." in the function name does not make something "S3".
> > There's no dispatch from import() to import.bed(). Had I not been a
> > total newb when I created rtracklayer, I would have called the
> > function importBed() or something like that. Sorry for the confusion.
> >
> > On Tue, Sep 17, 2019 at 5:34 AM Bhagwat, Aditya
> > <aditya.bhag...@mpi-bn.mpg.de> wrote:
> > >
> > > Oh, superb, thx!
> > >
> > > Interesting ... here you use S3 rather than S4 - I wonder the design 
> > > intention underlying these choices (I'm asking because I am trying to 
> > > figure out myself when to use S3 and when to use S4 and whether to mix 
> > > the two).
> > >
> > > Aditya
> > >
> > > ________________________________________
> > > From: Michael Lawrence [lawrence.mich...@gene.com]
> > > Sent: Tuesday, September 17, 2019 2:23 PM
> > > To: Bhagwat, Aditya
> > > Cc: Michael Lawrence; Shepherd, Lori; bioc-devel@r-project.org
> > > Subject: Re: [Bioc-devel] read_bed()
> > >
> > > The generic documentation does not mention it, but see ?import.bed.
> > > It's similar to colClasses on read.table().
> > >
> > > On Tue, Sep 17, 2019 at 5:15 AM Bhagwat, Aditya
> > > <aditya.bhag...@mpi-bn.mpg.de> wrote:
> > > >
> > > > Thankyou Michael,
> > > >
> > > > How do I use the extraCols argument? The documentation does not mention 
> > > > an `extraCols` argument explicitly, so it must be one of the ellipsis 
> > > > arguments, but `?rtracklayer::import` does not mention it. Should I say 
> > > > extraCols = 10 (ten extra columns) or so?
> > > >
> > > > Aditya
> > > >
> > > > ________________________________________
> > > > From: Michael Lawrence [lawrence.mich...@gene.com]
> > > > Sent: Tuesday, September 17, 2019 2:05 PM
> > > > To: Bhagwat, Aditya
> > > > Cc: Michael Lawrence; Shepherd, Lori; bioc-devel@r-project.org
> > > > Subject: Re: [Bioc-devel] read_bed()
> > > >
> > > > It breaks it because it's not standard BED; however, using the
> > > > extraCols= argument should work in this case. Requiring an explicit
> > > > format specification is intentional, because it provides validation
> > > > and type safety, and it communicates the format to a future reader.
> > > > This also looks a bit like a bedPE file, so you might consider using
> > > > the Pairs data structure.
> > > >
> > > > Michael
> > > >
> > > > On Tue, Sep 17, 2019 at 4:51 AM Bhagwat, Aditya
> > > > <aditya.bhag...@mpi-bn.mpg.de> wrote:
> > > > >
> > > > > Hi Michael,
> > > > >
> > > > > Yeah, I also noticed that the attachment was eaten when it entered 
> > > > > the bio-devel list.
> > > > >
> > > > > The file is also accessible in the extdata of the multicrispr:
> > > > > https://gitlab.gwdg.de/loosolab/software/multicrispr/blob/master/inst/extdata/SRF.bed
> > > > >
> > > > > A bedfile to GRanges importer requires columns 1 (chrom), 2 
> > > > > (chromStart), 3 (chromEnd), and column 6 (strand). All of these are 
> > > > > present in SRF.bed.
> > > > >
> > > > > I am curious as to why you feel that having additional columns in a 
> > > > > bedfile would break it?
> > > > >
> > > > > Cheers,
> > > > >
> > > > > Aditya
> > > > >
> > > > > ________________________________________
> > > > > From: Michael Lawrence [lawrence.mich...@gene.com]
> > > > > Sent: Tuesday, September 17, 2019 1:41 PM
> > > > > To: Bhagwat, Aditya
> > > > > Cc: Shepherd, Lori; bioc-devel@r-project.org
> > > > > Subject: Re: [Bioc-devel] read_bed()
> > > > >
> > > > > I don't see an attachment, nor can I find the multicrispr package
> > > > > anywhere. The "addressed soon" was referring to the BEDX+Y formats,
> > > > > which was addressed many years ago, so I've updated the documentation.
> > > > > Broken BED files will never be supported.
> > > > >
> > > > > Michael
> > > > >
> > > > > On Tue, Sep 17, 2019 at 4:17 AM Bhagwat, Aditya
> > > > > <aditya.bhag...@mpi-bn.mpg.de> wrote:
> > > > > >
> > > > > > Hi Lori,
> > > > > >
> > > > > > I remember now - I tried this function earlier, but it does not 
> > > > > > work for my bedfiles, like the one in attach.
> > > > > >
> > > > > > > bedfile <- system.file('extdata/SRF.bed', package = 'multicrispr')
> > > > > > >
> > > > > > > targetranges <- rtracklayer::import(bedfile, format = 'BED', 
> > > > > > > genome = 'mm10')
> > > > > > Error in scan(file = file, what = what, sep = sep, quote = quote, 
> > > > > > dec = dec, : scan() expected 'an integer', got 'chr2'
> > > > > > >
> > > > > >
> > > > > > Perhaps this sentence in `?rtracklayer::import` points to the 
> > > > > > source of the error?
> > > > > >
> > > > > > many tools and organizations have extended BED with additional 
> > > > > > columns. These are not officially valid BED files, and as such 
> > > > > > rtracklayer does not yet support them (this will be addressed soon).
> > > > > >
> > > > > > Which brings the question: how soon is soon :-D ?
> > > > > >
> > > > > > Aditya
> > > > > >
> > > > > >
> > > > > > ________________________________
> > > > > > From: Shepherd, Lori [lori.sheph...@roswellpark.org]
> > > > > > Sent: Tuesday, September 17, 2019 1:02 PM
> > > > > > To: Bhagwat, Aditya; bioc-devel@r-project.org
> > > > > > Subject: Re: read_bed()
> > > > > >
> > > > > > Please look at rtracklayer::import() function that we recommend for 
> > > > > > reading of BAM files along with other common formats.
> > > > > >
> > > > > > Cheers,
> > > > > >
> > > > > >
> > > > > > Lori Shepherd
> > > > > >
> > > > > > Bioconductor Core Team
> > > > > >
> > > > > > Roswell Park Cancer Institute
> > > > > >
> > > > > > Department of Biostatistics & Bioinformatics
> > > > > >
> > > > > > Elm & Carlton Streets
> > > > > >
> > > > > > Buffalo, New York 14263
> > > > > >
> > > > > > ________________________________
> > > > > > From: Bioc-devel <bioc-devel-boun...@r-project.org> on behalf of 
> > > > > > Bhagwat, Aditya <aditya.bhag...@mpi-bn.mpg.de>
> > > > > > Sent: Tuesday, September 17, 2019 6:58 AM
> > > > > > To: bioc-devel@r-project.org <bioc-devel@r-project.org>
> > > > > > Subject: [Bioc-devel] read_bed()
> > > > > >
> > > > > > Dear bioc-devel,
> > > > > >
> > > > > > I had two feedback requests regarding the function read_bed().
> > > > > >
> > > > > > 1) Did I overlook, and therefore, re-invent existing functionality?
> > > > > > 2) If not, would `read_bed` be suited for existence in a more 
> > > > > > foundational package, e.g. `GenomicRanges`, given the rather basal 
> > > > > > nature of this functionality?
> > > > > >
> > > > > > It reads a bedfile into a GRanges, converts the coordinates from 
> > > > > > 0-based (bedfile) to 1-based 
> > > > > > (GRanges)<https://www.biostars.org/p/84686>, adds BSgenome info (to 
> > > > > > allow for implicit range validity 
> > > > > > checking<https://support.bioconductor.org/p/124250>) and plots the 
> > > > > > karyogram<https://support.bioconductor.org/p/124328>.
> > > > > >
> > > > > > Thank you for your feedback.
> > > > > >
> > > > > > Cheers,
> > > > > >
> > > > > > Aditya
> > > > > >
> > > > > >
> > > > > > #' Read bedfile into GRanges
> > > > > > #'
> > > > > > #' @param bedfile file path
> > > > > > #' @param bsgenome BSgenome, e.g. 
> > > > > > BSgenome.Mmusculus.UCSC.mm10::Mmusculus
> > > > > > #' @param zero_based logical(1): whether bedfile GRanges are 0-based
> > > > > > #' @param rm_duplicates logical(1)
> > > > > > #' @param plot logical(1)
> > > > > > #' @param verbose logical(1)
> > > > > > #' @return \code{\link[GenomicRanges]{GRanges-class}}
> > > > > > #' @note By convention BED files are 0-based. GRanges are always 
> > > > > > 1-based.
> > > > > > #' A good discussion on these two alternative codings is given
> > > > > > #' by Obi Griffith on Biostars: https://www.biostars.org/p/84686/
> > > > > > #' @examples
> > > > > > #' bedfile <- system.file('extdata/SRF.bed', package = 
> > > > > > 'multicrispr')
> > > > > > #' bsgenome <- BSgenome.Mmusculus.UCSC.mm10::Mmusculus
> > > > > > #' (gr <- read_bed(bedfile, bsgenome))
> > > > > > #' @importFrom data.table :=
> > > > > > #' @export
> > > > > > read_bed <- function(
> > > > > > bedfile,
> > > > > > bsgenome,
> > > > > > zero_based = TRUE,
> > > > > > rm_duplicates = TRUE,
> > > > > > plot = TRUE,
> > > > > > verbose = TRUE
> > > > > > ){
> > > > > > # Assert
> > > > > > assert_all_are_existing_files(bedfile)
> > > > > > assert_is_a_bool(verbose)
> > > > > > assert_is_a_bool(rm_duplicates)
> > > > > > assert_is_a_bool(zero_based)
> > > > > >
> > > > > > # Comply
> > > > > > seqnames <- start <- end <- strand <- .N <- gap <- width <- NULL
> > > > > >
> > > > > > # Read
> > > > > > if (verbose) cmessage('\tRead %s', bedfile)
> > > > > > dt <- data.table::fread(bedfile, select = c(seq_len(3), 6),
> > > > > > col.names = c('seqnames', 'start', 'end', 'strand'))
> > > > > > data.table::setorderv(dt, c('seqnames', 'start', 'end', 'strand'))
> > > > > >
> > > > > > # Transform coordinates: 0-based -> 1-based
> > > > > > if (zero_based){
> > > > > > if (verbose) cmessage('\t\tConvert 0 -> 1-based')
> > > > > > dt[, start := start + 1]
> > > > > > }
> > > > > >
> > > > > > if (verbose) cmessage('\t\tRanges: %d ranges on %d chromosomes',
> > > > > > nrow(dt), length(unique(dt$seqnames)))
> > > > > >
> > > > > > # Drop duplicates
> > > > > > if (rm_duplicates){
> > > > > > is_duplicated <- cduplicated(dt)
> > > > > > if (any(is_duplicated)){
> > > > > > if (verbose) cmessage('\t\t %d after removing duplicates')
> > > > > > dt %<>% extract(!duplicated)
> > > > > > }
> > > > > > }
> > > > > >
> > > > > > # Turn into GRanges
> > > > > > gr <- add_seqinfo(as(dt, 'GRanges'), bsgenome)
> > > > > >
> > > > > > # Plot and return
> > > > > > title <- paste0(providerVersion(bsgenome), ': ', basename(bedfile))
> > > > > > if (plot) plot_karyogram(gr, title)
> > > > > > gr
> > > > > > }
> > > > > >
> > > > > >
> > > > > > [[alternative HTML version deleted]]
> > > > > >
> > > > > > _______________________________________________
> > > > > > Bioc-devel@r-project.org mailing list
> > > > > > https://stat.ethz.ch/mailman/listinfo/bioc-devel
> > > > > >
> > > > > > This email message may contain legally privileged and/or 
> > > > > > confidential information. If you are not the intended recipient(s), 
> > > > > > or the employee or agent responsible for the delivery of this 
> > > > > > message to the intended recipient(s), you are hereby notified that 
> > > > > > any disclosure, copying, distribution, or use of this email message 
> > > > > > is prohibited. If you have received this message in error, please 
> > > > > > notify the sender immediately by e-mail and delete this email 
> > > > > > message from your computer. Thank you.
> > > > > > _______________________________________________
> > > > > > Bioc-devel@r-project.org mailing list
> > > > > > https://stat.ethz.ch/mailman/listinfo/bioc-devel
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Michael Lawrence
> > > > > Scientist, Bioinformatics and Computational Biology
> > > > > Genentech, A Member of the Roche Group
> > > > > Office +1 (650) 225-7760
> > > > > micha...@gene.com
> > > > >
> > > > > Join Genentech on LinkedIn | Twitter | Facebook | Instagram | YouTube
> > > >
> > > >
> > > >
> > > > --
> > > > Michael Lawrence
> > > > Scientist, Bioinformatics and Computational Biology
> > > > Genentech, A Member of the Roche Group
> > > > Office +1 (650) 225-7760
> > > > micha...@gene.com
> > > >
> > > > Join Genentech on LinkedIn | Twitter | Facebook | Instagram | YouTube
> > >
> > >
> > >
> > > --
> > > Michael Lawrence
> > > Scientist, Bioinformatics and Computational Biology
> > > Genentech, A Member of the Roche Group
> > > Office +1 (650) 225-7760
> > > micha...@gene.com
> > >
> > > Join Genentech on LinkedIn | Twitter | Facebook | Instagram | YouTube
> >
> >
> >
> > --
> > Michael Lawrence
> > Scientist, Bioinformatics and Computational Biology
> > Genentech, A Member of the Roche Group
> > Office +1 (650) 225-7760
> > micha...@gene.com
> >
> > Join Genentech on LinkedIn | Twitter | Facebook | Instagram | YouTube
> >
> > _______________________________________________
> > Bioc-devel@r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
>
>
> --
> Michael Lawrence
> Scientist, Bioinformatics and Computational Biology
> Genentech, A Member of the Roche Group
> Office +1 (650) 225-7760
> micha...@gene.com
>
> Join Genentech on LinkedIn | Twitter | Facebook | Instagram | YouTube
>
> _______________________________________________
> Bioc-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
>         [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel



--
Michael Lawrence
Scientist, Bioinformatics and Computational Biology
Genentech, A Member of the Roche Group
Office +1 (650) 225-7760
micha...@gene.com

Join Genentech on LinkedIn | Twitter | Facebook | Instagram | YouTube

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Re: [Bioc-devel] read_bed()

Reply via email to