Hello Martin, Jonathan and everybody, I was unaware of the existence of scanBam and read.bed().
I'm glad I brought up the issue. Thank you. Ivan Ivan Gregoretti, PhD National Institute of Diabetes and Digestive and Kidney Diseases National Institutes of Health 5 Memorial Dr, Building 5, Room 205. Bethesda, MD 20892. USA. Phone: 1-301-496-1016 and 1-301-496-1592 Fax: 1-301-496-9878 On Thu, Mar 10, 2011 at 9:08 AM, Jonathan Cairns <[email protected]> wrote: > FWIW, If you need only the chr, start, end and strand information from a BED > file, the function read.bed() in the BayesPeak package does this. Moreover it > is usually around twice as fast as import() because it ignores the rest of > the columns. However, it cannot yet handle the compressed .bed.gz format. > > Perhaps this would be a useful option in import.bed()? From my experience, > use of BED files is not that uncommon for ChIP-seq reads. > > Jonathan > ________________________________________ > From: [email protected] > [[email protected]] On Behalf Of Ivan Gregoretti > [[email protected]] > Sent: 09 March 2011 16:18 > To: Michael Lawrence > Cc: [email protected] > Subject: Re: [Bioc-sig-seq] BED file parser > > I use BED because it uses less memory. > > BAM format contains the read names, the sequences, the quality string > and more information. I do not need that. I only need chromosome name, > start, end, and strand. > > So, for almost all my analyses, I start by converting my .bam to a > minimalistic .bed.gz outside R and then from R I load my tags into a > GRanges with import(). > > As simple as that. > > Ivan > > > Ivan Gregoretti, PhD > National Institute of Diabetes and Digestive and Kidney Diseases > National Institutes of Health > 5 Memorial Dr, Building 5, Room 205. > Bethesda, MD 20892. USA. > Phone: 1-301-496-1016 and 1-301-496-1592 > Fax: 1-301-496-9878 > > > > On Wed, Mar 9, 2011 at 10:51 AM, Michael Lawrence > <[email protected]> wrote: >> >> >> On Wed, Mar 9, 2011 at 7:33 AM, Ivan Gregoretti <[email protected]> wrote: >>> >>> I find simple BED files to be slow to import. I only use BED without >>> track headers. The data is derived mostly from *-seq so we are talking >>> about multiple million lines per file. >>> >>> The problem as I understand it is that the function reads one row at a >>> time. It could be much faster if it read, say, 1000 rows at a time. >>> >> >> I hope it's not reading one row at a time. It just calls read.table(), in a >> fairly efficient way, with colClasses specified, etc. Why do you have high >> throughput sequencing results in BED files? BED is really for genes. Most >> other things fit into BAM, bedGraph (which uses the same basic parser >> though), WIG, etc. >> >>> >>> I never get errors. There are no bugs to fix. It's just very slow for >>> the real world of high throughput sequencing. That's all. >>> >>> Thanks, >>> >>> Ivan >>> >>> >>> Ivan Gregoretti, PhD >>> National Institute of Diabetes and Digestive and Kidney Diseases >>> National Institutes of Health >>> 5 Memorial Dr, Building 5, Room 205. >>> Bethesda, MD 20892. USA. >>> Phone: 1-301-496-1016 and 1-301-496-1592 >>> Fax: 1-301-496-9878 >>> >>> >>> >>> On Wed, Mar 9, 2011 at 10:21 AM, Michael Lawrence >>> <[email protected]> wrote: >>> > >>> > >>> > On Wed, Mar 9, 2011 at 6:41 AM, Ivan Gregoretti <[email protected]> >>> > wrote: >>> >> >>> >> Just to expand a little bit Vincent's response. >>> >> >>> >> If you happen to be handling very large BED files, you probably keep >>> >> them compressed. The good news is that even in that case, you can load >>> >> them: >>> >> >>> >> lit = import("~/lit.bed.gz"."bed") >>> >> >>> >> There is still the long-standing issue of how slow the import() >>> >> function is but I am still hopeful. >>> >> >>> > >>> > This is the first I've heard of this. What sort of files are slow? Do >>> > they >>> > have a track line? The parsing gets complicated when there are track >>> > lines >>> > and multiple tracks in a file. BED is a complex format with many >>> > variants. >>> > >>> >> >>> >> Ivan >>> >> >>> >> Ivan Gregoretti, PhD >>> >> National Institute of Diabetes and Digestive and Kidney Diseases >>> >> National Institutes of Health >>> >> 5 Memorial Dr, Building 5, Room 205. >>> >> Bethesda, MD 20892. USA. >>> >> Phone: 1-301-496-1016 and 1-301-496-1592 >>> >> Fax: 1-301-496-9878 >>> >> >>> >> >>> >> >>> >> On Tue, Mar 8, 2011 at 9:26 PM, Vincent Carey >>> >> <[email protected]> wrote: >>> >> > 2011/3/8 Thiago Yukio Kikuchi Oliveira <[email protected]>: >>> >> >> Hi, >>> >> >> >>> >> >> Is there a BED file parser for R? >>> >> > >>> >> > I suppose it depends on what you mean by "parser". import() from the >>> >> > rtracklayer package imports BED and constructs and populates a >>> >> > RangedData object with the contents. Here we look at a small bed >>> >> > file >>> >> > in text, >>> >> > start R, load rtracklayer, import the data, show the result, and show >>> >> > the resources used. >>> >> > >>> >> > bash-3.2$ head ~/junc716_20.bed >>> >> > chr20 55658 64827 JUNC00000001 14 + 55658 64827 >>> >> > 255,0,0 2 27,25 0,9144 >>> >> > chr20 55662 64821 JUNC00000002 2 - 55662 64821 >>> >> > 255,0,0 2 34,8 0,9151 >>> >> > chr20 135774 147029 JUNC00000003 1 - 135774 >>> >> > 147029 >>> >> > 255,0,0 2 8,29 0,11226 >>> >> > chr20 167951 172361 JUNC00000004 1 + 167951 >>> >> > 172361 >>> >> > 255,0,0 2 29,8 0,4402 >>> >> > chr20 189824 192113 JUNC00000005 3 + 189824 >>> >> > 192113 >>> >> > 255,0,0 2 33,9 0,2280 >>> >> > chr20 189829 192113 JUNC00000006 3 + 189829 >>> >> > 192113 >>> >> > 255,0,0 2 32,9 0,2275 >>> >> > chr20 193930 199576 JUNC00000007 4 - 193930 >>> >> > 199576 >>> >> > 255,0,0 2 28,11 0,5635 >>> >> > chr20 207050 207846 JUNC00000008 2 - 207050 >>> >> > 207846 >>> >> > 255,0,0 2 20,34 0,762 >>> >> > chr20 218306 218925 JUNC00000009 1 - 218306 >>> >> > 218925 >>> >> > 255,0,0 2 11,26 0,593 >>> >> > chr20 221160 225070 JUNC00000010 25 - 221160 >>> >> > 225070 >>> >> > 255,0,0 2 29,9 0,3901 >>> >> > bash-3.2$ head ~/junc716_20.bed > ~/lit.bed >>> >> > bash-3.2$ R213 --vanilla --quiet >>> >> >> library(rtracklayer) >>> >> > Loading required package: RCurl >>> >> > Loading required package: bitops >>> >> >> lit = import("~/lit.bed") >>> >> >> lit >>> >> > RangedData with 10 rows and 9 value columns across 1 space >>> >> > space ranges | name score strand >>> >> > thickStart >>> >> > <character> <IRanges> | <character> <numeric> <character> >>> >> > <integer> >>> >> > 1 chr20 [ 55659, 64827] | JUNC00000001 14 + >>> >> > 55658 >>> >> > 2 chr20 [ 55663, 64821] | JUNC00000002 2 - >>> >> > 55662 >>> >> > 3 chr20 [135775, 147029] | JUNC00000003 1 - >>> >> > 135774 >>> >> > 4 chr20 [167952, 172361] | JUNC00000004 1 + >>> >> > 167951 >>> >> > 5 chr20 [189825, 192113] | JUNC00000005 3 + >>> >> > 189824 >>> >> > 6 chr20 [189830, 192113] | JUNC00000006 3 + >>> >> > 189829 >>> >> > 7 chr20 [193931, 199576] | JUNC00000007 4 - >>> >> > 193930 >>> >> > 8 chr20 [207051, 207846] | JUNC00000008 2 - >>> >> > 207050 >>> >> > 9 chr20 [218307, 218925] | JUNC00000009 1 - >>> >> > 218306 >>> >> > 10 chr20 [221161, 225070] | JUNC00000010 25 - >>> >> > 221160 >>> >> > thickEnd itemRgb blockCount blockSizes blockStarts >>> >> > <integer> <character> <integer> <character> <character> >>> >> > 1 64827 #FF0000 2 27,25 0,9144 >>> >> > 2 64821 #FF0000 2 34,8 0,9151 >>> >> > 3 147029 #FF0000 2 8,29 0,11226 >>> >> > 4 172361 #FF0000 2 29,8 0,4402 >>> >> > 5 192113 #FF0000 2 33,9 0,2280 >>> >> > 6 192113 #FF0000 2 32,9 0,2275 >>> >> > 7 199576 #FF0000 2 28,11 0,5635 >>> >> > 8 207846 #FF0000 2 20,34 0,762 >>> >> > 9 218925 #FF0000 2 11,26 0,593 >>> >> > 10 225070 #FF0000 2 29,9 0,3901 >>> >> > >>> >> >> sessionInfo() >>> >> > R version 2.13.0 Under development (unstable) (2011-03-01 r54628) >>> >> > Platform: x86_64-apple-darwin10.4.0/x86_64 (64-bit) >>> >> > >>> >> > locale: >>> >> > [1] C >>> >> > >>> >> > attached base packages: >>> >> > [1] stats graphics grDevices utils datasets methods base >>> >> > >>> >> > other attached packages: >>> >> > [1] rtracklayer_1.11.11 RCurl_1.5-0 bitops_1.0-4.1 >>> >> > >>> >> > loaded via a namespace (and not attached): >>> >> > [1] BSgenome_1.19.4 Biobase_2.11.9 Biostrings_2.19.15 >>> >> > [4] GenomicRanges_1.3.23 IRanges_1.9.25 Matrix_0.999375-47 >>> >> > [7] XML_3.2-0 grid_2.13.0 lattice_0.19-17 >>> >> > >>> >> > >>> >> >> >>> >> >> >>> >> >> Thanks >>> >> >> >>> >> >> / Thiago Yukio Kikuchi Oliveira >>> >> >> (=\ >>> >> >> \=) Faculdade de Medicina de Ribeirão Preto >>> >> >> / Laboratório de Genética Molecular e Bioinformática >>> >> >> /=) >>> >> >> ----------------------------------------------------------------- >>> >> >> (=/ Centro de Terapia Celular/CEPID/FAPESP - Hemocentro de Rib. >>> >> >> Preto >>> >> >> / Rua Tenente Catão Roxo, 2501 CEP 14151-140 >>> >> >> (=\ Ribeirão Preto - São Paulo >>> >> >> \=) Fone: 55 16 2101-9300 Ramal: 9603 >>> >> >> / E-mail: [email protected] >>> >> >> /=) [email protected] >>> >> >> (=/ >>> >> >> / Bioinformatic Team - BiT: http://lgmb.fmrp.usp.br >>> >> >> (=\ Hemocentro de Ribeirão Preto: http://pegasus.fmrp.usp.br >>> >> >> \=) >>> >> >> / >>> >> >> ----------------------------------------------------------------- >>> >> >> >>> >> >> _______________________________________________ >>> >> >> Bioc-sig-sequencing mailing list >>> >> >> [email protected] >>> >> >> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing >>> >> >> >>> >> > >>> >> > _______________________________________________ >>> >> > Bioc-sig-sequencing mailing list >>> >> > [email protected] >>> >> > https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing >>> >> > >>> >> >>> >> _______________________________________________ >>> >> Bioc-sig-sequencing mailing list >>> >> [email protected] >>> >> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing >>> > >>> > >> >> > > _______________________________________________ > Bioc-sig-sequencing mailing list > [email protected] > https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing > > This communication is from Cancer Research UK. Our website is at > www.cancerresearchuk.org. We are a registered charity in England and Wales > (1089464) and in Scotland (SC041666) and a company limited by guarantee > registered in England and Wales under number 4325234. Our registered address > is Angel Building, 407 St John Street, London, EC1V 4AD. Our central > telephone number is 020 7242 0200. > > This communication and any attachments contain information which is > confidential and may also be privileged. It is for the exclusive use of the > intended recipient(s). If you are not the intended recipient(s) please note > that any form of disclosure, distribution, copying or use of this > communication or the information in it or in any attachments is strictly > prohibited and may be unlawful. If you have received this communication in > error, please notify the sender and delete the email and destroy any copies > of it. > > E-mail communications cannot be guaranteed to be secure or error free, as > information could be intercepted, corrupted, amended, lost, destroyed, arrive > late or incomplete, or contain viruses. We do not accept liability for any > such matters or their consequences. Anyone who communicates with us by > e-mail is taken to accept the risks in doing so. > _______________________________________________ Bioc-sig-sequencing mailing list [email protected] https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
