Sumit,
The ShortRead package uses a convention where filters can be used to weed out unwanted data. One of the filters is the polynFilter, which filters out reads with excessive amounts of the selected nucleotides. There is an unfortunate bug in polynFilter when only one nucleotide type is chosen, but I just fixed it in the svn repository and it will be come available on bioconductor.org in a day or so. Here is an example of filtering out reads with 32 or more A's in them using the polynFilter function (this operation filtered out 2 reads with the example data):

> suppressMessages(library(ShortRead))
> sp <- SolexaPath(system.file("extdata", package="ShortRead"))
> aln <- readAligned(sp, "s_2_export.txt") # Solexa export file, as example
> polyAFilt <- polynFilter(threshold = 32, nuc = "A")
> aln
class: AlignedRead
length: 1000 reads; width: 35 cycles
chromosome: NM NM ... chr5.fa 29:255:255
position: NA NA ... 71805980 NA
strand: NA NA ... + NA
alignQuality: NumericQuality
alignData varLabels: run lane ... y filtering
> aln[polyAFilt(aln)]
class: AlignedRead
length: 998 reads; width: 35 cycles
chromosome: NM NM ... chr5.fa 29:255:255
position: NA NA ... 71805980 NA
strand: NA NA ... + NA
alignQuality: NumericQuality
alignData varLabels: run lane ... y filtering
> sessionInfo()
R version 2.9.0 Under development (unstable) (2009-02-23 r47990)
i386-apple-darwin9.6.0

locale:
en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] ShortRead_1.1.44 lattice_0.17-20 BSgenome_1.11.11 Biostrings_2.11.39 [5] IRanges_1.1.47
loaded via a namespace (and not attached):
[1] Biobase_2.3.10 grid_2.9.0 hwriter_1.1


Patrick


Middha, Sumit wrote:
Hi,

I was writing to check if there is a usable poly-A removal function to
remove the poly-reads where all bases are A's .. From what I understand,
this happens because of a constant intensity originating from a spec or
edges of the lane.

I will search for the same, but I am also looking for a start-up set of
commands to load the requisite libraries along with ShortReads to get
onto this analysis.

Cheers,
Sumit

-----Original Message-----
From: bioc-sig-sequencing-boun...@r-project.org
[mailto:bioc-sig-sequencing-boun...@r-project.org] On Behalf Of Cei
Abreu-Goodger
Sent: Sunday, February 22, 2009 6:23 PM
To: bioc-sig-sequencing@r-project.org
Subject: [Bioc-sig-seq] Low-complexity read filtering/trimming

Hi all,

I've been playing around with some Solexa small-RNA reads using ShortRead and Biostrings. I've used the 'trimLRPatterns' function to remove adapter sequence, and I've been trying to remove low-complexity sequences with 'srFilter'. I would first really like to congratulate all

the people involved for the great work. There are two situations in which I would be grateful for some suggestions, though:

1) I have many "low-complexity" reads. Some are simply polyA, polyC, etc. But some others are runs of "ATATAT" or "CACACACA", etc. Previously

I would have used "dust" on the command line to filter out this kind of read in a fasta file. Any ideas on how to achieve similar functionality in the ShortRead world?

2) For some reads I may have a "N-rich" patch inside the read, for
example:
AATAAAGTGCTTACAGTGNNNNTNNATNCAATACCG

I would ideally like to trim of everything starting at the "N-rich" part. I was trying to implement something with 'vmatchPattern', but if I

allow for mismatches (for a more flexible search) I will also get hits starting before the run of Ns.

Many thanks,

Cei



sessionInfo()

R version 2.9.0 Under development (unstable) (2009-02-13 r47919)
i386-apple-darwin9.6.0

locale:
C

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base

other attached packages:
[1] ShortRead_1.1.39 lattice_0.17-20 BSgenome_1.11.9 Biostrings_2.11.28
[5] IRanges_1.1.38     Biobase_2.3.10

loaded via a namespace (and not attached):
[1] Matrix_0.999375-20 grid_2.9.0




_______________________________________________
Bioc-sig-sequencing mailing list
Bioc-sig-sequencing@r-project.org
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Reply via email to