Re: [Bioc-sig-seq] Implementing workflow.pdf (BioC2010 lab handout)

Michael Lawrence Mon, 16 Aug 2010 14:55:02 -0700

On Mon, Aug 16, 2010 at 12:44 PM, [email protected] <
[email protected]> wrote:


>  Dear bioc-sig-sequencing,
>
> I am trying to follow a BioC2010 lab handout (
> http://www.bioconductor.org/help/course-materials/2010/BioC2010/Workflow.pdf)
> to find differential peaks in a chipseq experiment.  The "unevaluated code
> block" on page one of the handout is,
>
> > filter <- compose(chipseqFilter(), alignQualityFilter(15))
> > cstest <- seqapply(sampleFiles, function(file) {
> + as(readAligned(file, filter), "GRanges")
> + })
>
> To try to implement this,
>
> > dataDir <- "/home/mterry/ceru_data"
> > dataDir
> [1] "/home/mterry/ceru_data"
>
> > pattern <- "s_[8I]_export_chr1.txt"
> > list.files(dataDir, pattern)
> [1] "s_8_export_chr1.txt" "s_I_export_chr1.txt"
>
> Two Eland aligned files selected (analogous to cstest$ctcf & cstest$gfp in
> handout).
>
> > filt1 <- alignDataFilter(expression(filtering=="Y"))
> > filt2 <- chromosomeFilter("chr[0-9XYM]+.fas")
> > filt3 <- occurrenceFilter(withSread = FALSE)
> > filt <- compose(filt1, filt2, filt3)
>
>
Comments below, but btw, the chipseqFilter() command makes this easier, with
the second two filters (and maybe even the first) combined into:
chipseqFilter(exclude="_"). Calling chipseqFilter() with no arguments will
also exclude the sex and mitochondrial chromosomes.

> arabtest <- seqapply(dataDir, function(file) {
> +   as(readAligned(file, pattern, type="SolexaExport", filter=filt),
> "GRanges")
> + })
> > arabtest
> GRangesList of length 1
> [[1]]
> GRanges with 1989354 ranges and 7 elementMetadata values
>          seqnames               ranges strand   |      run      lane
>  tile
>             <Rle>            <IRanges>  <Rle>   | <factor> <integer>
> <integer>
>      [1] chr1.fas [ 7568294,  7568328]      +   |        9         8
>   1
>      [2] chr1.fas [  167488,   167522]      +   |        9         8
>   1
>
> > sessionInfo()
> R version 2.12.0 Under development (unstable) (2010-08-02 r52661)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
>  [1] LC_CTYPE=en_US.utf8       LC_NUMERIC=C
>  [3] LC_TIME=en_US.utf8        LC_COLLATE=en_US.utf8
>  [5] LC_MONETARY=C             LC_MESSAGES=en_US.utf8
>  [7] LC_PAPER=en_US.utf8       LC_NAME=C
>  [9] LC_ADDRESS=C              LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] GenomicFeatures_1.1.6 chipseq_0.99.1        ShortRead_1.7.11
> [4] Rsamtools_1.1.11      lattice_0.18-8        BSgenome_1.17.6
> [7] Biostrings_2.17.27    GenomicRanges_1.1.20  IRanges_1.7.15
>
> loaded via a namespace (and not attached):
> [1] Biobase_2.9.0     biomaRt_2.5.1     DBI_0.2-5         grid_2.12.0
> [5] hwriter_1.2       RCurl_1.4-3       RSQLite_0.9-2     rtracklayer_1.9.4
> [9] XML_3.1-0
> >
>
>
> My questions/observations:
>
> 1. I was expecting "GRangesList of length 2", not "1"?
>

You are looping over the directory, not the files that match the pattern. It
appears you've made the extremely common mistake of assuming that
readAligned returns a separate object for each file matching the pattern. As
its name suggests, it returns an AlignedRead object, which does not
distinguish between input files (everything is merged).

Again, this is a common mistake, and it would be great if ShortRead had some
way of more easily achieving what you want to do here (a very common use
case).


> 2. From separate runs, when selecting only one of the Eland aligned files
> at a time, the "1989354" ranges found here (from the "two file pattern")
>  are nearly (but not exactly) the sum of the ranges for the two files.
>
>
3. What should I have done to get the expected "GRangesList of length 2"
> with ranges from each input file in separate elements of the GRangesList
> object?
>
>
Loop over the output of your list.files() command instead, except you'll
need full.names=TRUE.


> 4. An ancillary question, how can I change the designation to a name of my
> choosing for the GRangeslist elements (for example, the lab handout has
> "$ctcf", while I have "[[1]]")?
>
>
seqapply(), like lapply(), will use the names from the input vector. So just
name your filename vector.


>
> Thanks,
> P. Terry
> [email protected]
>
>        [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> [email protected]
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>

        [[alternative HTML version deleted]]

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Re: [Bioc-sig-seq] Implementing workflow.pdf (BioC2010 lab handout)

Reply via email to