Re: [Bioc-sig-seq] Comparing two chipseq position sets

Hervé Pagès Wed, 20 May 2009 17:01:02 -0700

Hi Ivan,

Ivan Gregoretti wrote:

Hello Nicolas and everybody,


Have you managed to get the output of the peak finder MACS into a
genomeIntervals object?


The output of MACS as seen through a data.frame object looks like this

head(A, 10)

    chr   start     end length summit tags Tpvalue fold_enrichment  FDR
1  chr1 3660782 3662707   1926    749   64  492.04           38.44 0.11
2  chr1 4481743 4482656    914    179   14   83.64           10.01 0.88
3  chr1 4482814 4484003   1190    671   18  108.48           16.43 0.17
4  chr1 4561321 4562262    942    472   13  136.26           30.96 0.04
5  chr1 4774888 4776304   1417    781   60  674.41           78.07 0.15
6  chr1 4797292 4798822   1531    813   56  382.46           42.73 0.06
7  chr1 4847808 4848846   1039    533   26  325.36           82.17 0.05
8  chr1 5008094 5009386   1293    973   20  117.98           13.28 0.08
9  chr1 5009515 5010046    532    179   10   94.86           24.65 0.38
10 chr1 5010096 5010583    488    170    6   52.31           20.54 5.39


My attempt to make it a genomeIntervals object goes as far as this:

giA <- new("Genome_intervals", A[ ,2:3], closed=c(TRUE,TRUE),
annotation=data.frame(A[ ,1], inter_base=FALSE, strand="+", A[ ,4:9]))

The operation reports no errors but it doesn't really work because I can't
use it. For instance, I can't make a simple call to the first three elements
like this

giA[1:3]

Error in .nextMethod(x, i, j, ..., drop) : subscript out of bounds


Because your giA object is broken (have you tried validObject on it?
or to just display it?)

Try:

  giA_annotation <- data.frame(seq_name=A[ ,1], # note the required named field!
                               inter_base=FALSE,
                               strand="+",
                               A[ ,4:9])

  giA <- new("Genome_intervals",
             as.matrix(A[ ,2:3]),
             closed=c(TRUE,TRUE),
             annotation=giA_annotation)

  > validObject(giA)
  [1] TRUE

Maybe the authors of the genomeIntervals package should not let you
build broken Genome_intervals instances so easily...

Cheers,
H.


Any insights?

Thank you,

Ivan



Ivan Gregoretti, PhD
National Institute of Diabetes and Digestive and Kidney Diseases
National Institutes of Health
5 Memorial Dr, Building 5, Room 205.
Bethesda, MD 20892. USA.
Phone: 1-301-496-1592
Fax: 1-301-496-9878


On Thu, May 7, 2009 at 11:29 AM, Nicolas Delhomme <[email protected]> wrote:

Hi Ivan,

I'm not yet so familiar with iRanges, just starting to use it. By the way,
I use the opportunity to thank the guys behing those libraries (ShortRead,
chipseq, iRanges, etc.) as they are doing a tremendous work. Chapeau bas
messieurs!
Back to your question, the fact that A and B have different size is not a
problem for genomeIntervals. The interval_overlap function will return a
list, which for every interval of A will have a vector of the position in B
of the corresponding overlapping interval.
I agree, this is confusing, so here is the bit of code applied to for your
example:

require(genomeIntervals)

A.set<-new("Genome_intervals",
               matrix(
                       c(3660781,3662707,
                       4481742,4482656,
                       4482813,4484003,
                       4561320,4562262,
                       4774887,4776304,
                       4797291,4798822,
                       4847807,4848846,
                       5008093,5009386,
                       5009514,5010046,
                       5010095,5010583),ncol=2,byrow=TRUE),
               closed=c(TRUE,TRUE),
               annotation=data.frame(
               seq_name=rep("chr1",10),
               inter_base=FALSE,
               strand="+"
               )
          )

B.set<-new("Genome_intervals",
               matrix(
                       c(
                        3659579,3662079,
                         4773791,4776291,
                          4797473,4799973,
                           4847394,4849894,
                            5007460,5009960,
                             5072753,5075253,
                              6204242,6206742,
                               7078730,7081230,
                                9282452,9284952,
                                 9683423,9685923),ncol=2,byrow=TRUE),
               closed=c(TRUE,TRUE),
               annotation=data.frame(
               seq_name=rep("chr1",10),
               inter_base=FALSE,
               strand="+"
               )
          )

A.B.overlap<-interval_overlap(A.set,B.set)

A.B.overlap

[[1]]
[1] 1

[[2]]
integer(0)

[[3]]
integer(0)

[[4]]
integer(0)

[[5]]
[1] 2

[[6]]
[1] 3

[[7]]
[1] 4

[[8]]
[1] 5

[[9]]
[1] 5

[[10]]
integer(0)

As you can see the first interval in A overlap with the first in B; the
fifth in A with the second in B and so on...

The comment from Steve are very good, I've read about it quite often in the
literature, especially for ChIP-chip but I don't know if there's a package
around. Let me know if you find/know about one.

Best,

---------------------------------------------------------------
Nicolas Delhomme

High Throughput Functional Genomics Center

European Molecular Biology Laboratory

Tel: +49 6221 387 8426
Email: [email protected]
Meyerhofstrasse 1 - Postfach 10.2209
69102 Heidelberg, Germany
---------------------------------------------------------------



On 7 May 2009, at 16:02, Ivan Gregoretti wrote:

 Hello Steve, Nicolas and Michael,

I agree with all of you: it is not a trivial question.

I asked the bioc-sig-seq listers because I thought, --Hey, this must
be the everyday's question of the genome analyst.

Say you ran your chipseq under condition A and then you ran it under
condition B. Then you have to decide whether A and B made any
difference. It doesn't get any simpler than that!

I can't compare the two means or the two dispersions. I have to
compare pairs. The problem is that it is not trivial to unambiguously
determine which spot in B must be paired with each spot in A. To start
with, A and B may have different numbers of loci (ie 15000 versus
18000).

I'll take a look at genomeIntervals and IRanges.

By the way, Michael, would you let me know as soon as the new IRanges
documentation comes out? You guys were working on something, I
understand.

Thank you all,

Ivan

Ivan Gregoretti, PhD
National Institute of Diabetes and Digestive and Kidney Diseases
National Institutes of Health
5 Memorial Dr, Building 5, Room 205.
Bethesda, MD 20892. USA.
Phone: 1-301-496-1592
Fax: 1-301-496-9878



On Thu, May 7, 2009 at 9:24 AM, Michael Lawrence <[email protected]>
wrote:


On Wed, May 6, 2009 at 12:40 PM, Ivan Gregoretti <[email protected]>
wrote:

Hello Bioc-sig-seq,

Say you run your ChIP-seq and find binding positions like this

chr1  3660781  3662707
chr1  4481742  4482656
chr1  4482813  4484003
chr1  4561320  4562262
chr1  4774887  4776304
chr1  4797291  4798822
chr1     4847807  4848846
chr1  5008093  5009386
chr1  5009514  5010046
chr1  5010095  5010583
...[many more loci and chromosomes]...

Then you want to compare it to published data like this

chr1  3659579  3662079
chr1  4773791  4776291
chr1  4797473  4799973
chr1  4847394  4849894
chr1  5007460  5009960
chr1  5072753  5075253
chr1  6204242  6206742
chr1  7078730  7081230
chr1  9282452  9284952
chr1  9683423  9685923
...[many more loci and chromosomes]...

What method would you use to test whether these two lists are
significantly different?

This is a tough statistical question that probably needs to be a bit more
specific, but as far as technical tools, in addition to genomeIntervals
there is the IRanges package and its efficient "overlap" function.
IRanges
is well integrated with the rest of sequence analysis infrastructure in
Bioconductor.

Any pointer would be appreciated.

Ivan

Ivan Gregoretti, PhD
National Institute of Diabetes and Digestive and Kidney Diseases
National Institutes of Health
5 Memorial Dr, Building 5, Room 205.
Bethesda, MD 20892. USA.
Phone: 1-301-496-1592
Fax: 1-301-496-9878

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing


        [[alternative HTML version deleted]]

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing


--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: [email protected]
Phone:  (206) 667-5791
Fax:    (206) 667-1319

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Re: [Bioc-sig-seq] Comparing two chipseq position sets

Reply via email to