Re: [Bioc-sig-seq] chipseq infrastructure

Michael Lawrence Mon, 01 Mar 2010 22:58:00 -0800

On Mon, Mar 1, 2010 at 8:57 PM, Deepayan Sarkar
<[email protected]>wrote:


> On Mon, Mar 1, 2010 at 7:08 AM, Michael Lawrence
> <[email protected]> wrote:
> > Hey guys,
> >
> > I'm wondering if anyone has given any thought to some sort of generic
> > framework for chipseq analysis in Bioconductor, based on the IRanges,
> > Biostrings, etc infrastructure. chipseq has some nice utilities; could it
> be
> > transformed into some sort of generic chipseq pipeline? Something like
> how
> > the 'affy' package (I think?) allows other packages to provide
> alternative
> > implementations for particular stages. Just having a clean, refined,
> > approximately complete set of chipseq-focused utilities would be nice.
> > Presumably chipseq could fill that role? I think we now have a good idea
> of
> > the basic steps in chipseq analysis, so it's probably time for such a
> > package to emerge.
> >
> > Comments?
>
> Good idea of course, but will need thought. We should probably start
> with identifying typical stages of the analysis, and formulating
> suitable data structures. What we have now is:
>
>  - Data I/O and QA: External software + ShortRead
>
>  - Data reduction: Is "GenomeDataList" good, or do we want something
> else as an intermediate on-disk storage format?
>
>  - Modeling + Peak Calling: Is coverage the right abstraction? We have
> one method based on coverage, but not all methods are.
>
>   I'm also not sure how much of this can be put into a framework. For
> example, it's not clear how genomic annotation can be incorporated.
> One can call peaks and then "intersect" with promoter regions, or
> bypass peak-calling and start directly with promoter regions.
>
>
Granted, we need to be flexible, but I think we can definitely come up with
a set of data structures that are very commonly used with chip-seq data.
Right now chipseq relies heavily on GenomeData(List), which I see as a
fallback for when a more appropriate structure does not exist. Should
GenomeDataList be considered the equivalent of the ExpressionSet for
microarray analysis? Or could we come up with a more constrained container?

I think we want want a convenient way to construct the data structure via
ShortRead I/O, but skipping the sequence data. Coverage and then peaks or
"regions of interest" like promoters could then be found and added to the
container.  We wouldn't need to keep the read positions in memory; they
could be dropped after calculating coverage, but loaded on demand if needed
later.

The xcms package might be a good model here.


>   In the chipseq package, we basically gave up trying to formalize
> this, and made it free-for-all after the data reduction step. I'm not
> sure we can do better unless we restrict to specific pipelines.
>
> -Deepayan
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> [email protected]
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>

        [[alternative HTML version deleted]]

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Re: [Bioc-sig-seq] chipseq infrastructure

Reply via email to