Re: [Bioc-sig-seq] Low-complexity read filtering/trimming [PolyA removal]

Patrick Aboyoun Thu, 12 Mar 2009 10:50:53 -0700

Sumit,

The ShortRead package uses a convention where filters can be used toweed out unwanted data. One of the filters is the polynFilter, whichfilters out reads with excessive amounts of the selected nucleotides.There is an unfortunate bug in polynFilter when only one nucleotide typeis chosen, but I just fixed it in the svn repository and it will be comeavailable on bioconductor.org in a day or so. Here is an example offiltering out reads with 32 or more A's in them using the polynFilterfunction (this operation filtered out 2 reads with the example data):


> suppressMessages(library(ShortRead))
> sp <- SolexaPath(system.file("extdata", package="ShortRead"))
> aln <- readAligned(sp, "s_2_export.txt") # Solexa export file, as example
> polyAFilt <- polynFilter(threshold = 32, nuc = "A")
> aln
class: AlignedRead
length: 1000 reads; width: 35 cycles
chromosome: NM NM ... chr5.fa 29:255:255
position: NA NA ... 71805980 NA
strand: NA NA ... + NA
alignQuality: NumericQuality
alignData varLabels: run lane ... y filtering
> aln[polyAFilt(aln)]
class: AlignedRead
length: 998 reads; width: 35 cycles
chromosome: NM NM ... chr5.fa 29:255:255
position: NA NA ... 71805980 NA
strand: NA NA ... + NA
alignQuality: NumericQuality
alignData varLabels: run lane ... y filtering
> sessionInfo()
R version 2.9.0 Under development (unstable) (2009-02-23 r47990)
i386-apple-darwin9.6.0

locale:
en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8

attached base packages:

[1] stats graphics grDevices utils datasets methods base

other attached packages:

[1] ShortRead_1.1.44 lattice_0.17-20 BSgenome_1.11.11Biostrings_2.11.39[5] IRanges_1.1.47

loaded via a namespace (and not attached):

[1] Biobase_2.3.10 grid_2.9.0 hwriter_1.1



Patrick


Middha, Sumit wrote:

Hi,

I was writing to check if there is a usable poly-A removal function to
remove the poly-reads where all bases are A's .. From what I understand,
this happens because of a constant intensity originating from a spec or
edges of the lane.

I will search for the same, but I am also looking for a start-up set of
commands to load the requisite libraries along with ShortReads to get
onto this analysis.

Cheers,
Sumit

-----Original Message-----
From: bioc-sig-sequencing-boun...@r-project.org
[mailto:bioc-sig-sequencing-boun...@r-project.org] On Behalf Of Cei
Abreu-Goodger
Sent: Sunday, February 22, 2009 6:23 PM
To: bioc-sig-sequencing@r-project.org
Subject: [Bioc-sig-seq] Low-complexity read filtering/trimming

Hi all,
I've been playing around with some Solexa small-RNA reads usingShortRead and Biostrings. I've used the 'trimLRPatterns' function toremove adapter sequence, and I've been trying to remove low-complexitysequences with 'srFilter'. I would first really like to congratulate all
the people involved for the great work. There are two situations inwhich I would be grateful for some suggestions, though:
1) I have many "low-complexity" reads. Some are simply polyA, polyC,etc. But some others are runs of "ATATAT" or "CACACACA", etc. Previously
I would have used "dust" on the command line to filter out this kind ofread in a fasta file. Any ideas on how to achieve similar functionalityin the ShortRead world?
2) For some reads I may have a "N-rich" patch inside the read, for
example:
AATAAAGTGCTTACAGTGNNNNTNNATNCAATACCG
I would ideally like to trim of everything starting at the "N-rich"part. I was trying to implement something with 'vmatchPattern', but if I
allow for mismatches (for a more flexible search) I will also get hitsstarting before the run of Ns.
Many thanks,

Cei



sessionInfo()

R version 2.9.0 Under development (unstable) (2009-02-13 r47919)
i386-apple-darwin9.6.0

locale:
C

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base

other attached packages:
[1] ShortRead_1.1.39 lattice_0.17-20 BSgenome_1.11.9Biostrings_2.11.28
[5] IRanges_1.1.38     Biobase_2.3.10

loaded via a namespace (and not attached):
[1] Matrix_0.999375-20 grid_2.9.0


_______________________________________________
Bioc-sig-sequencing mailing list
Bioc-sig-sequencing@r-project.org
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Re: [Bioc-sig-seq] Low-complexity read filtering/trimming [PolyA removal]

Reply via email to