Re: [Bioc-sig-seq] BED file parser

Jonathan Cairns Thu, 10 Mar 2011 06:10:23 -0800

FWIW, If you need only the chr, start, end and strand information from a BED 
file, the function read.bed() in the BayesPeak package does this. Moreover it 
is usually around twice as fast as import() because it ignores the rest of the 
columns. However, it cannot yet handle the compressed .bed.gz format.


Perhaps this would be a useful option in import.bed()? From my experience, use 
of BED files is not that uncommon for ChIP-seq reads.

Jonathan
________________________________________
From: [email protected] 
[[email protected]] On Behalf Of Ivan Gregoretti 
[[email protected]]
Sent: 09 March 2011 16:18
To: Michael Lawrence
Cc: [email protected]
Subject: Re: [Bioc-sig-seq] BED file parser

I use BED because it uses less memory.

BAM format contains the read names, the sequences, the quality string
and more information. I do not need that. I only need chromosome name,
start, end, and strand.

So, for almost all my analyses, I start by converting my .bam to a
minimalistic .bed.gz outside R and then from R I load my tags into a
GRanges with import().

As simple as that.

Ivan


Ivan Gregoretti, PhD
National Institute of Diabetes and Digestive and Kidney Diseases
National Institutes of Health
5 Memorial Dr, Building 5, Room 205.
Bethesda, MD 20892. USA.
Phone: 1-301-496-1016 and 1-301-496-1592
Fax: 1-301-496-9878



On Wed, Mar 9, 2011 at 10:51 AM, Michael Lawrence
<[email protected]> wrote:
>
>
> On Wed, Mar 9, 2011 at 7:33 AM, Ivan Gregoretti <[email protected]> wrote:
>>
>> I find simple BED files to be slow to import. I only use BED without
>> track headers. The data is derived mostly from *-seq so we are talking
>> about multiple million lines per file.
>>
>> The problem as I understand it is that the function reads one row at a
>> time. It could be much faster if it read, say, 1000 rows at a time.
>>
>
> I hope it's not reading one row at a time. It just calls read.table(), in a
> fairly efficient way, with colClasses specified, etc. Why do you have high
> throughput sequencing results in BED files? BED is really for genes. Most
> other things fit into BAM, bedGraph (which uses the same basic parser
> though), WIG, etc.
>
>>
>> I never get errors. There are no bugs to fix. It's just very slow for
>> the real world of high throughput sequencing. That's all.
>>
>> Thanks,
>>
>> Ivan
>>
>>
>> Ivan Gregoretti, PhD
>> National Institute of Diabetes and Digestive and Kidney Diseases
>> National Institutes of Health
>> 5 Memorial Dr, Building 5, Room 205.
>> Bethesda, MD 20892. USA.
>> Phone: 1-301-496-1016 and 1-301-496-1592
>> Fax: 1-301-496-9878
>>
>>
>>
>> On Wed, Mar 9, 2011 at 10:21 AM, Michael Lawrence
>> <[email protected]> wrote:
>> >
>> >
>> > On Wed, Mar 9, 2011 at 6:41 AM, Ivan Gregoretti <[email protected]>
>> > wrote:
>> >>
>> >> Just to expand a little bit Vincent's response.
>> >>
>> >> If you happen to be handling very large BED files, you probably keep
>> >> them compressed. The good news is that even in that case, you can load
>> >> them:
>> >>
>> >> lit = import("~/lit.bed.gz"."bed")
>> >>
>> >> There is still the long-standing issue of how slow the import()
>> >> function is but I am still hopeful.
>> >>
>> >
>> > This is the first I've heard of this. What sort of files are slow? Do
>> > they
>> > have a track line? The parsing gets complicated when there are track
>> > lines
>> > and multiple tracks in a file. BED is a complex format with many
>> > variants.
>> >
>> >>
>> >> Ivan
>> >>
>> >> Ivan Gregoretti, PhD
>> >> National Institute of Diabetes and Digestive and Kidney Diseases
>> >> National Institutes of Health
>> >> 5 Memorial Dr, Building 5, Room 205.
>> >> Bethesda, MD 20892. USA.
>> >> Phone: 1-301-496-1016 and 1-301-496-1592
>> >> Fax: 1-301-496-9878
>> >>
>> >>
>> >>
>> >> On Tue, Mar 8, 2011 at 9:26 PM, Vincent Carey
>> >> <[email protected]> wrote:
>> >> > 2011/3/8 Thiago Yukio Kikuchi Oliveira <[email protected]>:
>> >> >> Hi,
>> >> >>
>> >> >> Is there a BED file parser for R?
>> >> >
>> >> > I suppose it depends on what you mean by "parser".  import() from the
>> >> > rtracklayer package imports BED and constructs and populates a
>> >> > RangedData object with the contents.  Here we look at a small bed
>> >> > file
>> >> > in text,
>> >> > start R, load rtracklayer, import the data, show the result, and show
>> >> > the resources used.
>> >> >
>> >> > bash-3.2$ head ~/junc716_20.bed
>> >> > chr20   55658   64827   JUNC00000001    14      +       55658   64827
>> >> >  255,0,0 2       27,25   0,9144
>> >> > chr20   55662   64821   JUNC00000002    2       -       55662   64821
>> >> >  255,0,0 2       34,8    0,9151
>> >> > chr20   135774  147029  JUNC00000003    1       -       135774
>> >> >  147029
>> >> >  255,0,0 2       8,29    0,11226
>> >> > chr20   167951  172361  JUNC00000004    1       +       167951
>> >> >  172361
>> >> >  255,0,0 2       29,8    0,4402
>> >> > chr20   189824  192113  JUNC00000005    3       +       189824
>> >> >  192113
>> >> >  255,0,0 2       33,9    0,2280
>> >> > chr20   189829  192113  JUNC00000006    3       +       189829
>> >> >  192113
>> >> >  255,0,0 2       32,9    0,2275
>> >> > chr20   193930  199576  JUNC00000007    4       -       193930
>> >> >  199576
>> >> >  255,0,0 2       28,11   0,5635
>> >> > chr20   207050  207846  JUNC00000008    2       -       207050
>> >> >  207846
>> >> >  255,0,0 2       20,34   0,762
>> >> > chr20   218306  218925  JUNC00000009    1       -       218306
>> >> >  218925
>> >> >  255,0,0 2       11,26   0,593
>> >> > chr20   221160  225070  JUNC00000010    25      -       221160
>> >> >  225070
>> >> >  255,0,0 2       29,9    0,3901
>> >> > bash-3.2$ head ~/junc716_20.bed > ~/lit.bed
>> >> > bash-3.2$ R213 --vanilla --quiet
>> >> >> library(rtracklayer)
>> >> > Loading required package: RCurl
>> >> > Loading required package: bitops
>> >> >> lit = import("~/lit.bed")
>> >> >> lit
>> >> > RangedData with 10 rows and 9 value columns across 1 space
>> >> >         space           ranges |         name     score      strand
>> >> > thickStart
>> >> >   <character>        <IRanges> |  <character> <numeric> <character>
>> >> >  <integer>
>> >> > 1        chr20 [ 55659,  64827] | JUNC00000001        14           +
>> >> >  55658
>> >> > 2        chr20 [ 55663,  64821] | JUNC00000002         2           -
>> >> >  55662
>> >> > 3        chr20 [135775, 147029] | JUNC00000003         1           -
>> >> > 135774
>> >> > 4        chr20 [167952, 172361] | JUNC00000004         1           +
>> >> > 167951
>> >> > 5        chr20 [189825, 192113] | JUNC00000005         3           +
>> >> > 189824
>> >> > 6        chr20 [189830, 192113] | JUNC00000006         3           +
>> >> > 189829
>> >> > 7        chr20 [193931, 199576] | JUNC00000007         4           -
>> >> > 193930
>> >> > 8        chr20 [207051, 207846] | JUNC00000008         2           -
>> >> > 207050
>> >> > 9        chr20 [218307, 218925] | JUNC00000009         1           -
>> >> > 218306
>> >> > 10       chr20 [221161, 225070] | JUNC00000010        25           -
>> >> > 221160
>> >> >    thickEnd     itemRgb blockCount  blockSizes blockStarts
>> >> >   <integer> <character>  <integer> <character> <character>
>> >> > 1      64827     #FF0000          2       27,25      0,9144
>> >> > 2      64821     #FF0000          2        34,8      0,9151
>> >> > 3     147029     #FF0000          2        8,29     0,11226
>> >> > 4     172361     #FF0000          2        29,8      0,4402
>> >> > 5     192113     #FF0000          2        33,9      0,2280
>> >> > 6     192113     #FF0000          2        32,9      0,2275
>> >> > 7     199576     #FF0000          2       28,11      0,5635
>> >> > 8     207846     #FF0000          2       20,34       0,762
>> >> > 9     218925     #FF0000          2       11,26       0,593
>> >> > 10    225070     #FF0000          2        29,9      0,3901
>> >> >
>> >> >> sessionInfo()
>> >> > R version 2.13.0 Under development (unstable) (2011-03-01 r54628)
>> >> > Platform: x86_64-apple-darwin10.4.0/x86_64 (64-bit)
>> >> >
>> >> > locale:
>> >> > [1] C
>> >> >
>> >> > attached base packages:
>> >> > [1] stats     graphics  grDevices utils     datasets  methods   base
>> >> >
>> >> > other attached packages:
>> >> > [1] rtracklayer_1.11.11 RCurl_1.5-0         bitops_1.0-4.1
>> >> >
>> >> > loaded via a namespace (and not attached):
>> >> > [1] BSgenome_1.19.4      Biobase_2.11.9       Biostrings_2.19.15
>> >> > [4] GenomicRanges_1.3.23 IRanges_1.9.25       Matrix_0.999375-47
>> >> > [7] XML_3.2-0            grid_2.13.0          lattice_0.19-17
>> >> >
>> >> >
>> >> >>
>> >> >>
>> >> >> Thanks
>> >> >>
>> >> >>     /    Thiago Yukio Kikuchi Oliveira
>> >> >> (=\
>> >> >>   \=) Faculdade de Medicina de Ribeirão Preto
>> >> >>    /   Laboratório de Genética Molecular e Bioinformática
>> >> >>   /=)
>> >> >> -----------------------------------------------------------------
>> >> >> (=/   Centro de Terapia Celular/CEPID/FAPESP - Hemocentro de Rib.
>> >> >> Preto
>> >> >>   /    Rua Tenente Catão Roxo, 2501 CEP 14151-140
>> >> >> (=\   Ribeirão Preto - São Paulo
>> >> >>   \=) Fone: 55 16 2101-9300   Ramal: 9603
>> >> >>    /   E-mail: [email protected]
>> >> >>   /=)            [email protected]
>> >> >> (=/
>> >> >>   /    Bioinformatic Team - BiT: http://lgmb.fmrp.usp.br
>> >> >> (=\   Hemocentro de Ribeirão Preto: http://pegasus.fmrp.usp.br
>> >> >>   \=)
>> >> >>    /
>> >> >> -----------------------------------------------------------------
>> >> >>
>> >> >> _______________________________________________
>> >> >> Bioc-sig-sequencing mailing list
>> >> >> [email protected]
>> >> >> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>> >> >>
>> >> >
>> >> > _______________________________________________
>> >> > Bioc-sig-sequencing mailing list
>> >> > [email protected]
>> >> > https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>> >> >
>> >>
>> >> _______________________________________________
>> >> Bioc-sig-sequencing mailing list
>> >> [email protected]
>> >> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>> >
>> >
>
>

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

This communication is from Cancer Research UK. Our website is at 
www.cancerresearchuk.org. We are a registered charity in England and Wales 
(1089464) and in Scotland (SC041666) and a company limited by guarantee 
registered in England and Wales under number 4325234. Our registered address is 
Angel Building, 407 St John Street, London, EC1V 4AD. Our central telephone 
number is 020 7242 0200.

This communication and any attachments contain information which is 
confidential and may also be privileged.   It is for the exclusive use of the 
intended recipient(s).  If you are not the intended recipient(s) please note 
that any form of disclosure, distribution, copying or use of this communication 
or the information in it or in any attachments is strictly prohibited and may 
be unlawful.  If you have received this communication in error, please notify 
the sender and delete the email and destroy any copies of it.

E-mail communications cannot be guaranteed to be secure or error free, as 
information could be intercepted, corrupted, amended, lost, destroyed, arrive 
late or incomplete, or contain viruses.  We do not accept liability for any 
such matters or their consequences.  Anyone who communicates with us by e-mail 
is taken to accept the risks in doing so.

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Re: [Bioc-sig-seq] BED file parser

Reply via email to