Sumit,
The ShortRead package uses a convention where filters can be used to
weed out unwanted data. One of the filters is the polynFilter, which
filters out reads with excessive amounts of the selected nucleotides.
There is an unfortunate bug in polynFilter when only one nucleotide type
is chosen, but I just fixed it in the svn repository and it will be come
available on bioconductor.org in a day or so. Here is an example of
filtering out reads with 32 or more A's in them using the polynFilter
function (this operation filtered out 2 reads with the example data):
> suppressMessages(library(ShortRead))
> sp <- SolexaPath(system.file("extdata", package="ShortRead"))
> aln <- readAligned(sp, "s_2_export.txt") # Solexa export file, as example
> polyAFilt <- polynFilter(threshold = 32, nuc = "A")
> aln
class: AlignedRead
length: 1000 reads; width: 35 cycles
chromosome: NM NM ... chr5.fa 29:255:255
position: NA NA ... 71805980 NA
strand: NA NA ... + NA
alignQuality: NumericQuality
alignData varLabels: run lane ... y filtering
> aln[polyAFilt(aln)]
class: AlignedRead
length: 998 reads; width: 35 cycles
chromosome: NM NM ... chr5.fa 29:255:255
position: NA NA ... 71805980 NA
strand: NA NA ... + NA
alignQuality: NumericQuality
alignData varLabels: run lane ... y filtering
> sessionInfo()
R version 2.9.0 Under development (unstable) (2009-02-23 r47990)
i386-apple-darwin9.6.0
locale:
en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] ShortRead_1.1.44 lattice_0.17-20 BSgenome_1.11.11
Biostrings_2.11.39
[5] IRanges_1.1.47
loaded via a namespace (and not attached):
[1] Biobase_2.3.10 grid_2.9.0 hwriter_1.1
Patrick
Middha, Sumit wrote:
Hi,
I was writing to check if there is a usable poly-A removal function to
remove the poly-reads where all bases are A's .. From what I understand,
this happens because of a constant intensity originating from a spec or
edges of the lane.
I will search for the same, but I am also looking for a start-up set of
commands to load the requisite libraries along with ShortReads to get
onto this analysis.
Cheers,
Sumit
-----Original Message-----
From: bioc-sig-sequencing-boun...@r-project.org
[mailto:bioc-sig-sequencing-boun...@r-project.org] On Behalf Of Cei
Abreu-Goodger
Sent: Sunday, February 22, 2009 6:23 PM
To: bioc-sig-sequencing@r-project.org
Subject: [Bioc-sig-seq] Low-complexity read filtering/trimming
Hi all,
I've been playing around with some Solexa small-RNA reads using
ShortRead and Biostrings. I've used the 'trimLRPatterns' function to
remove adapter sequence, and I've been trying to remove low-complexity
sequences with 'srFilter'. I would first really like to congratulate all
the people involved for the great work. There are two situations in
which I would be grateful for some suggestions, though:
1) I have many "low-complexity" reads. Some are simply polyA, polyC,
etc. But some others are runs of "ATATAT" or "CACACACA", etc. Previously
I would have used "dust" on the command line to filter out this kind of
read in a fasta file. Any ideas on how to achieve similar functionality
in the ShortRead world?
2) For some reads I may have a "N-rich" patch inside the read, for
example:
AATAAAGTGCTTACAGTGNNNNTNNATNCAATACCG
I would ideally like to trim of everything starting at the "N-rich"
part. I was trying to implement something with 'vmatchPattern', but if I
allow for mismatches (for a more flexible search) I will also get hits
starting before the run of Ns.
Many thanks,
Cei
sessionInfo()
R version 2.9.0 Under development (unstable) (2009-02-13 r47919)
i386-apple-darwin9.6.0
locale:
C
attached base packages:
[1] stats graphics grDevices datasets utils methods base
other attached packages:
[1] ShortRead_1.1.39 lattice_0.17-20 BSgenome_1.11.9
Biostrings_2.11.28
[5] IRanges_1.1.38 Biobase_2.3.10
loaded via a namespace (and not attached):
[1] Matrix_0.999375-20 grid_2.9.0
_______________________________________________
Bioc-sig-sequencing mailing list
Bioc-sig-sequencing@r-project.org
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing