Hi Christoph,

We are still discussing this matter internally and a few additional 
comments/questions have come up.  

1. We were surprised to hear of a 50x increase in size for wiggle to bigWig.  
Can you please describe the structure of the input that causes that kind of 
output bloat?  (How many bases per datapoint, how far apart are datapoints etc) 
 If you would like, you can send me an excerpt of the file (not to the list, 
that strips attachments).  

2. gzip of a bigBed file reduced its size only by a factor of 3, and if we 
adopt a block compression scheme like BAM's BGZF, we would get somewhat less.  
bigBed and bigWig actually contain a lot more information than the original bed 
or bedGraph/wiggle files.  They were designed with speed in mind, specifically 
the speed of genome browser display of large regions, so they include several 
layers of summary data that add up to the same size as the original dataset.  
Finally, bigWig has much more numerical precision than the binary wiggle 
format.  

3. We use a compressed file system (ZFS) on some of our servers -- your mileage 
may vary, but that might help.  

If you are interested in beta testing the BAM display code, please let me know. 
 

Angie


----- "Jennifer Jackson" <[email protected]> wrote:

> From: "Jennifer Jackson" <[email protected]>
> To: "Christoph Bock" <[email protected]>
> Cc: [email protected]
> Sent: Tuesday, September 29, 2009 8:39:59 AM GMT -08:00 US/Canada Pacific
> Subject: Re: [Genome] bigBed/bigWig files excessively large?
>
> Hello Christoph,
> 
> Thanks for your suggestions.  Adding compression to the big* formats 
> is on our radar but the work has not yet been scheduled.  Work is  
> underway on browser support for BAM files as custom tracks, and we  
> hope to release that early next year.
> 
> 
> Jennfer
> 
> 
> ------------------------------------------------ 
> Jennifer Jackson 
> UCSC Genome Bioinformatics Group 
> 
> ----- "Christoph Bock" <[email protected]> wrote:
> 
> > From: "Christoph Bock" <[email protected]>
> > To: [email protected]
> > Sent: Saturday, September 26, 2009 6:26:49 PM GMT -08:00 US/Canada
> Pacific
> > Subject: [Genome] bigBed/bigWig files excessively large?
> >
> > Dear Developers,
> > 
> > I noticed that *.bigBed files are often 10-fold or even 20-fold
> larger
> > than
> > their more classical *.bed.gz counterparts. Worse still, for
> *.bigWig
> > files
> > a 50-fold bloat is frequently observed.
> > Such file sizes become a significant problem when hosting thousands
> > of
> > high-throughput sequencing datasets for user inspection. It would
> be
> > great
> > if you could provide a solution that enables both (i) selective
> > upload, and
> > (ii) reasonable compression.
> > 
> > The BAM data format [http://samtools.sourceforge.net/SAM1.pdf]
> > demonstrates
> > that it is quite possible to combine index-based access with gzip
> > compression. I'm sure that a similar solution is feasible for
> > bigBed/bigWig
> > files as well. Alternatively or in addition, it would be very
> > convenient if
> > the UCSC Genome Browser could soon support the SAM/BAM file formats
> > (which
> > seem to emerge as the standard data format for large-scale
> sequencing
> > data
> > anyway).
> > 
> > Thanks,
> >  Christoph
> > 
> > ___________________________________________________________________
> > Dr. Christoph Bock [[email protected]]
> > Department of Stem Cell and Regenerative Biology
> > Harvard University
> > 
> > _______________________________________________
> > Genome maillist  -  [email protected]
> > https://lists.soe.ucsc.edu/mailman/listinfo/genome
> _______________________________________________
> Genome maillist  -  [email protected]
> https://lists.soe.ucsc.edu/mailman/listinfo/genome
_______________________________________________
Genome maillist  -  [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome

Reply via email to