Re: [Bioc-sig-seq] Comparing two chipseq position sets

Wolfgang Huber Tue, 12 May 2009 06:34:34 -0700


Hi all,


this paper addresses a related question:

Non Parametric Methods for Genomic Inference
PJ Bickel, N Boley, JB Brown, H Huang, NR Zhang
http://encodestatistics.org/publications/statpaper-July31.pdf

I don't claim to understand it though. Probably you are well served byan ad hoc measure of (the distribution of) interval overlap(s).


        Best wishes
        Wolfgang

Ivan Gregoretti ha scritto:

Hello Nicolas and everybody,

I wanted to explore genomeIntervals before replying to the list. I have now.

Indeed, genomeIntervals is extremely powerful and it should make the
comparision between two sets less difficult.

I'll try to come up with a clean and simple comparison method first
and then move on towards addressing Steve G's questions. Sooner or
later chipseqers will have to address them anyway.

Thank you everybody,

Ivan




Ivan Gregoretti, PhD
National Institute of Diabetes and Digestive and Kidney Diseases
National Institutes of Health
5 Memorial Dr, Building 5, Room 205.
Bethesda, MD 20892. USA.
Phone: 1-301-496-1592
Fax: 1-301-496-9878



On Thu, May 7, 2009 at 11:29 AM, Nicolas Delhomme <[email protected]> wrote:

Hi Ivan,

I'm not yet so familiar with iRanges, just starting to use it. By the way, I
use the opportunity to thank the guys behing those libraries (ShortRead,
chipseq, iRanges, etc.) as they are doing a tremendous work. Chapeau bas
messieurs!
Back to your question, the fact that A and B have different size is not a
problem for genomeIntervals. The interval_overlap function will return a
list, which for every interval of A will have a vector of the position in B
of the corresponding overlapping interval.
I agree, this is confusing, so here is the bit of code applied to for your
example:

require(genomeIntervals)

A.set<-new("Genome_intervals",
               matrix(
                       c(3660781,3662707,
                       4481742,4482656,
                       4482813,4484003,
                       4561320,4562262,
                       4774887,4776304,
                       4797291,4798822,
                       4847807,4848846,
                       5008093,5009386,
                       5009514,5010046,
                       5010095,5010583),ncol=2,byrow=TRUE),
               closed=c(TRUE,TRUE),
               annotation=data.frame(
               seq_name=rep("chr1",10),
               inter_base=FALSE,
               strand="+"
               )
          )

B.set<-new("Genome_intervals",
               matrix(
                       c(
                        3659579,3662079,
                         4773791,4776291,
                          4797473,4799973,
                           4847394,4849894,
                            5007460,5009960,
                             5072753,5075253,
                              6204242,6206742,
                               7078730,7081230,
                                9282452,9284952,
                                 9683423,9685923),ncol=2,byrow=TRUE),
               closed=c(TRUE,TRUE),
               annotation=data.frame(
               seq_name=rep("chr1",10),
               inter_base=FALSE,
               strand="+"
               )
          )

A.B.overlap<-interval_overlap(A.set,B.set)

A.B.overlap

[[1]]
[1] 1

[[2]]
integer(0)

[[3]]
integer(0)

[[4]]
integer(0)

[[5]]
[1] 2

[[6]]
[1] 3

[[7]]
[1] 4

[[8]]
[1] 5

[[9]]
[1] 5

[[10]]
integer(0)

As you can see the first interval in A overlap with the first in B; the
fifth in A with the second in B and so on...

The comment from Steve are very good, I've read about it quite often in the
literature, especially for ChIP-chip but I don't know if there's a package
around. Let me know if you find/know about one.

Best,

---------------------------------------------------------------
Nicolas Delhomme

High Throughput Functional Genomics Center

European Molecular Biology Laboratory

Tel: +49 6221 387 8426
Email: [email protected]
Meyerhofstrasse 1 - Postfach 10.2209
69102 Heidelberg, Germany
---------------------------------------------------------------



On 7 May 2009, at 16:02, Ivan Gregoretti wrote:

Hello Steve, Nicolas and Michael,

I agree with all of you: it is not a trivial question.

I asked the bioc-sig-seq listers because I thought, --Hey, this must
be the everyday's question of the genome analyst.

Say you ran your chipseq under condition A and then you ran it under
condition B. Then you have to decide whether A and B made any
difference. It doesn't get any simpler than that!

I can't compare the two means or the two dispersions. I have to
compare pairs. The problem is that it is not trivial to unambiguously
determine which spot in B must be paired with each spot in A. To start
with, A and B may have different numbers of loci (ie 15000 versus
18000).

I'll take a look at genomeIntervals and IRanges.

By the way, Michael, would you let me know as soon as the new IRanges
documentation comes out? You guys were working on something, I
understand.

Thank you all,

Ivan

Ivan Gregoretti, PhD
National Institute of Diabetes and Digestive and Kidney Diseases
National Institutes of Health
5 Memorial Dr, Building 5, Room 205.
Bethesda, MD 20892. USA.
Phone: 1-301-496-1592
Fax: 1-301-496-9878



On Thu, May 7, 2009 at 9:24 AM, Michael Lawrence <[email protected]>
wrote:


On Wed, May 6, 2009 at 12:40 PM, Ivan Gregoretti <[email protected]>
wrote:

Hello Bioc-sig-seq,

Say you run your ChIP-seq and find binding positions like this

chr1  3660781  3662707
chr1  4481742  4482656
chr1  4482813  4484003
chr1  4561320  4562262
chr1  4774887  4776304
chr1  4797291  4798822
chr1     4847807  4848846
chr1  5008093  5009386
chr1  5009514  5010046
chr1  5010095  5010583
...[many more loci and chromosomes]...

Then you want to compare it to published data like this

chr1  3659579  3662079
chr1  4773791  4776291
chr1  4797473  4799973
chr1  4847394  4849894
chr1  5007460  5009960
chr1  5072753  5075253
chr1  6204242  6206742
chr1  7078730  7081230
chr1  9282452  9284952
chr1  9683423  9685923
...[many more loci and chromosomes]...

What method would you use to test whether these two lists are
significantly different?

This is a tough statistical question that probably needs to be a bit more
specific, but as far as technical tools, in addition to genomeIntervals
there is the IRanges package and its efficient "overlap" function.
IRanges
is well integrated with the rest of sequence analysis infrastructure in
Bioconductor.

Any pointer would be appreciated.

Ivan

Ivan Gregoretti, PhD
National Institute of Diabetes and Digestive and Kidney Diseases
National Institutes of Health
5 Memorial Dr, Building 5, Room 205.
Bethesda, MD 20892. USA.
Phone: 1-301-496-1592
Fax: 1-301-496-9878

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing


_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing



--

Best wishes
     Wolfgang

------------------------------------------------
Wolfgang Huber, EMBL, http://www.ebi.ac.uk/huber

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Re: [Bioc-sig-seq] Comparing two chipseq position sets

Reply via email to