Re: [Bioc-devel] read_bed()

Bhagwat, Aditya Tue, 17 Sep 2019 05:50:20 -0700

Oh, superb, thx!

Interesting ... here you use S3 rather than S4 - I wonder the design intention 
underlying these choices (I'm asking because I am trying to figure out myself 
when to use S3 and when to use S4 and whether to mix the two).


Aditya

________________________________________
From: Michael Lawrence [lawrence.mich...@gene.com]
Sent: Tuesday, September 17, 2019 2:23 PM
To: Bhagwat, Aditya
Cc: Michael Lawrence; Shepherd, Lori; bioc-devel@r-project.org
Subject: Re: [Bioc-devel] read_bed()

The generic documentation does not mention it, but see ?import.bed.
It's similar to colClasses on read.table().

On Tue, Sep 17, 2019 at 5:15 AM Bhagwat, Aditya
<aditya.bhag...@mpi-bn.mpg.de> wrote:
>
> Thankyou Michael,
>
> How do I use the extraCols argument? The documentation does not mention an 
> `extraCols` argument explicitly, so it must be one of the ellipsis arguments, 
> but `?rtracklayer::import` does not mention it. Should I say extraCols = 10 
> (ten extra columns) or so?
>
> Aditya
>
> ________________________________________
> From: Michael Lawrence [lawrence.mich...@gene.com]
> Sent: Tuesday, September 17, 2019 2:05 PM
> To: Bhagwat, Aditya
> Cc: Michael Lawrence; Shepherd, Lori; bioc-devel@r-project.org
> Subject: Re: [Bioc-devel] read_bed()
>
> It breaks it because it's not standard BED; however, using the
> extraCols= argument should work in this case. Requiring an explicit
> format specification is intentional, because it provides validation
> and type safety, and it communicates the format to a future reader.
> This also looks a bit like a bedPE file, so you might consider using
> the Pairs data structure.
>
> Michael
>
> On Tue, Sep 17, 2019 at 4:51 AM Bhagwat, Aditya
> <aditya.bhag...@mpi-bn.mpg.de> wrote:
> >
> > Hi Michael,
> >
> > Yeah, I also noticed that the attachment was eaten when it entered the 
> > bio-devel list.
> >
> > The file is also accessible in the extdata of the multicrispr:
> > https://gitlab.gwdg.de/loosolab/software/multicrispr/blob/master/inst/extdata/SRF.bed
> >
> > A bedfile to GRanges importer requires columns 1 (chrom), 2 (chromStart), 3 
> > (chromEnd), and column 6 (strand). All of these are present in SRF.bed.
> >
> > I am curious as to why you feel that having additional columns in a bedfile 
> > would break it?
> >
> > Cheers,
> >
> > Aditya
> >
> > ________________________________________
> > From: Michael Lawrence [lawrence.mich...@gene.com]
> > Sent: Tuesday, September 17, 2019 1:41 PM
> > To: Bhagwat, Aditya
> > Cc: Shepherd, Lori; bioc-devel@r-project.org
> > Subject: Re: [Bioc-devel] read_bed()
> >
> > I don't see an attachment, nor can I find the multicrispr package
> > anywhere. The "addressed soon" was referring to the BEDX+Y formats,
> > which was addressed many years ago, so I've updated the documentation.
> > Broken BED files will never be supported.
> >
> > Michael
> >
> > On Tue, Sep 17, 2019 at 4:17 AM Bhagwat, Aditya
> > <aditya.bhag...@mpi-bn.mpg.de> wrote:
> > >
> > > Hi Lori,
> > >
> > > I remember now - I tried this function earlier, but it does not work for 
> > > my bedfiles, like the one in attach.
> > >
> > > > bedfile      <- system.file('extdata/SRF.bed', package = 'multicrispr')
> > > >
> > > > targetranges <- rtracklayer::import(bedfile, format = 'BED', genome = 
> > > > 'mm10')
> > > Error in scan(file = file, what = what, sep = sep, quote = quote, dec = 
> > > dec,  : scan() expected 'an integer', got 'chr2'
> > > >
> > >
> > > Perhaps this sentence in `?rtracklayer::import` points to the source of 
> > > the error?
> > >
> > > many tools and organizations have extended BED with additional columns. 
> > > These are not officially valid BED files, and as such rtracklayer does 
> > > not yet support them (this will be addressed soon).
> > >
> > > Which brings the question: how soon is soon :-D ?
> > >
> > > Aditya
> > >
> > >
> > > ________________________________
> > > From: Shepherd, Lori [lori.sheph...@roswellpark.org]
> > > Sent: Tuesday, September 17, 2019 1:02 PM
> > > To: Bhagwat, Aditya; bioc-devel@r-project.org
> > > Subject: Re: read_bed()
> > >
> > > Please look at rtracklayer::import()  function that we recommend for 
> > > reading of BAM files along with other common formats.
> > >
> > > Cheers,
> > >
> > >
> > > Lori Shepherd
> > >
> > > Bioconductor Core Team
> > >
> > > Roswell Park Cancer Institute
> > >
> > > Department of Biostatistics & Bioinformatics
> > >
> > > Elm & Carlton Streets
> > >
> > > Buffalo, New York 14263
> > >
> > > ________________________________
> > > From: Bioc-devel <bioc-devel-boun...@r-project.org> on behalf of Bhagwat, 
> > > Aditya <aditya.bhag...@mpi-bn.mpg.de>
> > > Sent: Tuesday, September 17, 2019 6:58 AM
> > > To: bioc-devel@r-project.org <bioc-devel@r-project.org>
> > > Subject: [Bioc-devel] read_bed()
> > >
> > > Dear bioc-devel,
> > >
> > > I had two feedback requests regarding the function read_bed().
> > >
> > > 1) Did I overlook, and therefore, re-invent existing functionality?
> > > 2) If not, would `read_bed` be suited for existence in a more 
> > > foundational package, e.g. `GenomicRanges`, given the rather basal nature 
> > > of this functionality?
> > >
> > > It reads a bedfile into a GRanges, converts the coordinates from 0-based 
> > > (bedfile) to 1-based (GRanges)<https://www.biostars.org/p/84686>, adds 
> > > BSgenome info (to allow for implicit range validity 
> > > checking<https://support.bioconductor.org/p/124250>) and plots the 
> > > karyogram<https://support.bioconductor.org/p/124328>.
> > >
> > > Thank you for your feedback.
> > >
> > > Cheers,
> > >
> > > Aditya
> > >
> > >
> > > #' Read bedfile into GRanges
> > > #'
> > > #' @param bedfile        file path
> > > #' @param bsgenome       BSgenome, e.g. 
> > > BSgenome.Mmusculus.UCSC.mm10::Mmusculus
> > > #' @param zero_based     logical(1): whether bedfile GRanges are 0-based
> > > #' @param rm_duplicates  logical(1)
> > > #' @param plot           logical(1)
> > > #' @param verbose        logical(1)
> > > #' @return \code{\link[GenomicRanges]{GRanges-class}}
> > > #' @note By convention BED files are 0-based. GRanges are always 1-based.
> > > #'       A good discussion on these two alternative codings is given
> > > #'       by Obi Griffith on Biostars: https://www.biostars.org/p/84686/
> > > #' @examples
> > > #' bedfile  <- system.file('extdata/SRF.bed', package = 'multicrispr')
> > > #' bsgenome <- BSgenome.Mmusculus.UCSC.mm10::Mmusculus
> > > #' (gr <- read_bed(bedfile, bsgenome))
> > > #' @importFrom  data.table  :=
> > > #' @export
> > > read_bed <- function(
> > >     bedfile,
> > >     bsgenome,
> > >     zero_based    = TRUE,
> > >     rm_duplicates = TRUE,
> > >     plot          = TRUE,
> > >     verbose       = TRUE
> > > ){
> > >     # Assert
> > >     assert_all_are_existing_files(bedfile)
> > >     assert_is_a_bool(verbose)
> > >     assert_is_a_bool(rm_duplicates)
> > >     assert_is_a_bool(zero_based)
> > >
> > >     # Comply
> > >     seqnames <- start <- end <- strand <- .N <- gap <- width <- NULL
> > >
> > >     # Read
> > >     if (verbose) cmessage('\tRead %s', bedfile)
> > >     dt <- data.table::fread(bedfile, select = c(seq_len(3), 6),
> > >             col.names = c('seqnames', 'start', 'end', 'strand'))
> > >     data.table::setorderv(dt, c('seqnames', 'start', 'end', 'strand'))
> > >
> > >     # Transform coordinates: 0-based -> 1-based
> > >     if (zero_based){
> > >         if (verbose)    cmessage('\t\tConvert 0 -> 1-based')
> > >         dt[, start := start + 1]
> > >     }
> > >
> > >     if (verbose) cmessage('\t\tRanges: %d ranges on %d chromosomes',
> > >                             nrow(dt), length(unique(dt$seqnames)))
> > >
> > >     # Drop duplicates
> > >     if (rm_duplicates){
> > >         is_duplicated <- cduplicated(dt)
> > >         if (any(is_duplicated)){
> > >             if (verbose) cmessage('\t\t        %d after removing 
> > > duplicates')
> > >             dt %<>% extract(!duplicated)
> > >         }
> > >     }
> > >
> > >     # Turn into GRanges
> > >     gr <-  add_seqinfo(as(dt, 'GRanges'), bsgenome)
> > >
> > >     # Plot and return
> > >     title <- paste0(providerVersion(bsgenome), ': ', basename(bedfile))
> > >     if (plot) plot_karyogram(gr, title)
> > >     gr
> > > }
> > >
> > >
> > >         [[alternative HTML version deleted]]
> > >
> > > _______________________________________________
> > > Bioc-devel@r-project.org mailing list
> > > https://stat.ethz.ch/mailman/listinfo/bioc-devel
> > >
> > > This email message may contain legally privileged and/or confidential 
> > > information. If you are not the intended recipient(s), or the employee or 
> > > agent responsible for the delivery of this message to the intended 
> > > recipient(s), you are hereby notified that any disclosure, copying, 
> > > distribution, or use of this email message is prohibited. If you have 
> > > received this message in error, please notify the sender immediately by 
> > > e-mail and delete this email message from your computer. Thank you.
> > > _______________________________________________
> > > Bioc-devel@r-project.org mailing list
> > > https://stat.ethz.ch/mailman/listinfo/bioc-devel
> >
> >
> >
> > --
> > Michael Lawrence
> > Scientist, Bioinformatics and Computational Biology
> > Genentech, A Member of the Roche Group
> > Office +1 (650) 225-7760
> > micha...@gene.com
> >
> > Join Genentech on LinkedIn | Twitter | Facebook | Instagram | YouTube
>
>
>
> --
> Michael Lawrence
> Scientist, Bioinformatics and Computational Biology
> Genentech, A Member of the Roche Group
> Office +1 (650) 225-7760
> micha...@gene.com
>
> Join Genentech on LinkedIn | Twitter | Facebook | Instagram | YouTube



--
Michael Lawrence
Scientist, Bioinformatics and Computational Biology
Genentech, A Member of the Roche Group
Office +1 (650) 225-7760
micha...@gene.com

Join Genentech on LinkedIn | Twitter | Facebook | Instagram | YouTube

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Re: [Bioc-devel] read_bed()

Reply via email to