On Fri, Sep 18, 2009 at 6:24 AM, Ivan Gregoretti <[email protected]> wrote:
> Hi Parick and everybody, > > > > To everyone, > > What other data reduction operations would you like to have on bed file > > import? > > > > > > Patrick > > BED functionality must-haves: > > well, a very common task is to load all chromosome BED records but > segregating by strand. In ChIP-seq analysis for example, an > accumulation of forward reads and the left and reverse reads on the > right is a good indicator of true peak presence. > > So, we need to be given the choice of loading "+", "-", or > unspecified. The BED specification > http://genome.ucsc.edu/goldenPath/help/customTrack.html#BED > says that a record without field number 6 (strand) is perfectly valid. > > This would be a useful filter. I hope it's clear though that these types of manipulations are pretty easy to do after loading the data, as well. > Now, regarding the WIG block counting, the user should be able to > specify the shiftSize. What's shiftSize? Well, each read is only the > end of a DNA fragment that is typically 120 to 200 bases. So, the > inferred position of the fragment should the its start position plus > 60 to 100 bases. If the fragment matches the reverse strand, then the > inferred centre of the fragment should be it 'end' minus 60 to 100. > That is the shiftSize. > > When no strand is specified, the centre of tag should be an acceptable > choice. > > BED functionality to brag about: > > It would be extremely useful to be able to selectively load BED > records contained in a set of genomic regions. (Something like the > %in% functionality that Martin recently added to the ShortRead > package.) > So, lets imagine a tags-containing file and a big regions-containing > file. Then we'd do > > myBigRegions <- import('myBigRegions.bed') > insideRegions <- import('myTags.bed', in=myBigRegions, strand=c("+")) > or also perhaps > outsideRegions <- import('myTags.bed', not_in=myBigRegions, strand=c("+")) > > All of this would be pretty easy to do within the proposed framework. Some high-level functionality, like using estimated fragment length in the coverage calculation, might belong in e.g. the chipseq package. Also, while BED is a common format, it's not the only one; really one wants block processing for every track format, WIG, GFF, etc, and this could even be generalized beyond tracks to loading data of any type. I know we've tossed around the idea of some sort of common I/O package. The low-level callback mechanism could find a home there. There could be incremental readers based directly on read.table and scan. Then rtracklayer could provide a handler that translates the table into a RangedData, and then delegates to the user and filter handlers. Michael Thank you, > > Ivan > > > > Ivan Gregoretti, PhD > National Institute of Diabetes and Digestive and Kidney Diseases > National Institutes of Health > 5 Memorial Dr, Building 5, Room 205. > Bethesda, MD 20892. USA. > Phone: 1-301-496-1592 > Fax: 1-301-496-9878 > [[alternative HTML version deleted]] _______________________________________________ Bioc-sig-sequencing mailing list [email protected] https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
