On 11/15/2010 06:51 PM, Janet Young wrote:
Hi again,
I'm interested in some sequences on the cow chrUn scaffolds, and am
having a bit of bother getting them. I think I might have
uncovered a
bug, although I might just be doing something wrong. The code and
output
below should explain all. Any suggestions?
thanks (again!),
Oops, you found another one. More below...
Janet Young
-------------------------------------------------------------------
Dr. Janet Young (Trask lab)
Fred Hutchinson Cancer Research Center
1100 Fairview Avenue N., C3-168,
P.O. Box 19024, Seattle, WA 98109-1024, USA.
tel: (206) 667 1471 fax: (206) 667 6524
email: jayoung ...at... fhcrc.org
http://www.fhcrc.org/labs/trask/
-------------------------------------------------------------------
library(BSgenome.Btaurus.UCSC.bosTau4)
###### this works, gives me a 50bp sequence
getSeq(Btaurus,"chr1",start=1,end=50)
[1] "TACCCCACTCACACTTATGGATAGATCAACTAAACAGAAAATTAACAAGG"
####### for some scaffolds in the chrUn.scaffolds pile, I don't
get an
error message, but getSeqs seems to ignore the start and end
coordinates
requested - e.g. the sequence returned here is the whole scaffold,
not
just the first 50bp
getSeq(Btaurus,"chrUn.004.11829",start=1,end=50)
[1]
"TCATGTGTTTCTTCCAGTCCAGCATTTCTCATGATGTACTCTGCATATAAGTTAAATAAACAGGGTG
ACAATATACAGCCTTGATGAACTCCTTTTCCTATTTGGAACCAGTCTGTTGTTCCATGTCCAGTTCTAACTGTTGCTTCCTGACCTGCATACAGATTTCTCAAGAGGCAGATCAGGTGTTCTCATCTCCTGAGAATTGAAGGTACAAATTGTAGTGTTTCAATTGGCACCATGCTAATTTATCTTGGCCTAAAATAGTGAATGGGCTTCCCTGGTGGCTCAGGTGGTAAAGAATCTGCCTGCAATGCTGGAGACCTGGGTTCAATATCTGGGTTGGGAAGATTACCTGGAGGAGGGCATGGAGGCTTACTCGAATATTCTTGCCTGGAAAATCTCCATGGACAGAGAAGCTGGGTGGGTTACTGTCCATGGGGTCGCAAAGAGTCAGACGTGACTGAGCAACTAAGCACAGCACAACACAAAATAGTGAATACTGAGCAAGTAAAGGAAAAACCTCTTCCTCTCAGAAATTGGTCTTCATTTTTTCATGAGAATTGCTAGTCTTCCTCCCAAAGCCAAAACCATAAATTTGTTAGTGTTTGACCTCAATATATTTTCTCTTAACTCAGCTTTTAAACCTTCTCTGCCTCCTGCTACCATTCACTTTCTAGTACATTTGAAATCTGTCCAAGCCATTCCTGGGGTTCAGGTGTCTGAGACCTGATTTATTTCATTGATATATTAAAACACCCTTGAATCCAGCCAACGTATGTGGCCAGTTTTACTTGCTTTGCTCCCATACTGGTAATGGAATTTTTATGGCTGTAAAATATCTGGGTCATGTGGCATTTTCATCTTCTGTTGTCTTGAGCTGGTATAGTTTTACCAACGTGCCATTAAGGGATGGTTCCTTTACCATCATTGTGCTTCCTGGGGCCTTGCCCACTTTGCACTGTAAGTCAGAACAAGAGACCCTCCAAG
TATTTAATTTCC"
Yes, the new code was ignoring the 'end'. This is now fixed.
#### for other scaffolds I just get an error message, although the
named
scaffold definitely exists (is it doing a partial match on the
name, not
an exact match?)
Yes it's doing partial matching (it uses grep()). This feature was
added a long time ago on a user's request. Note that the BSgenome
package for Cow is peculiar with its chrUn.scaffolds multiple
sequence
object. Multiple sequences objects in the BSgenome packages are
generally the upstream* objects which contain long/ugly sequence
names like
> names(Btaurus$upstream1000)[1:5]
[1] "BC118489_up_1000_chr1_142608595_r chr1:142608595-142609594"
[2] "BC151644_up_1000_chr1_44930841_f chr1:44930841-44931840"
[3] "BC116005_up_1000_chr1_46052279_f chr1:46052279-46053278"
[4] "BC133294_up_1000_chr1_44020590_f chr1:44020590-44021589"
[5] "BC109604_up_1000_chr1_65024340_r chr1:65024340-65025339"
so it seemed convenient to be able to specify only part of a name,
at least when working interactively.
Currently Cow and Zebrafish are the only BSgenome packages with
multiple sequences objects other than the upstream[125]000 objects.
Thanks to your feedback, it is now clear that the "grep feature" is
not always desirable. Here are 2 workarounds:
1) Modify the regular expression so that it will match the whole
thing:
> getSeq(Btaurus,"^chrUn.004.1022$", as.character=FALSE)
54139-letter "DNAString" instance
seq:
AAATCTAGAGTAGAGTTTGGAATTTCAGATATACAA
...GTGCTGCAATTCATGGGGTCACAAAGAGTCAGACAT
('as.character=FALSE' used here only to produce a compact output)
Note that this is not completely safe because the dots in
"chrUn.004.1022" could match any character. Using "^chrUn\\.004\\.
1022$"
would be safer. But, performance aside, having to escape all special
characters in the regex just to have it behave like if fixed=TRUE
was passed to the call to grep() doesn't sound very appealing.
2) So here is another one. With BSgenome 1.18.2, the "grep feature"
will be disable when you use a high level object like GRanges or
RangedData. So you will be able to do something like:
> getSeq(Btaurus, GRanges("chrUn.004.1022", IRanges(start=1,
end=10)), as.character=FALSE)
A DNAStringSet instance of length 1
width seq
[1] 10 AAATCTAGAG
But this workaround has its issues too (e.g. if I want to get the
entire sequence I need to know its end in advance).
OK none of those workarounds is totally satisfying so I'll try to
come up with something better for BSgenome 1.18.3 (maybe add an
'exact.match' argument to getSeq() that would be FALSE by default?)
Suggestions are welcome.
Thanks for your feedback,
H.
getSeq(Btaurus,"chrUn.004.1022")
Error in .getOneSeqFromBSgenomeMultipleSequences(x, names[i],
start[i], :
sequence chrUn.004.1022 found more than once, please use a non-
ambiguous
name
which ( names(Btaurus$chrUn.scaffolds) == "chrUn.004.1022" )
[1] 1022
grep ( "chrUn.004.1022" , names(Btaurus$chrUn.scaffolds) )
[1] 1022 10220 10221 10222 10223 10224 10225 10226 10227 10228 10229
sessionInfo()
R version 2.12.0 (2010-10-15)
Platform: i386-apple-darwin9.8.0/i386 (32-bit)
locale:
[1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] BSgenome.Btaurus.UCSC.bosTau4_1.3.16 BSgenome_1.18.1
[3] Biostrings_2.18.0 GenomicRanges_1.2.1
[5] IRanges_1.8.2
loaded via a namespace (and not attached):
[1] Biobase_2.10.0
_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: [email protected]
Phone: (206) 667-5791
Fax: (206) 667-1319