Re: [Bioc-sig-seq] getSeq and Btaurus$chrUn.scaffolds - ambiguous name error

Janet Young Wed, 24 Nov 2010 16:01:59 -0800

Hi,

Looks like BSgenome 1.18.2 went live a couple of hours ago - all looksgood for me now.


Thanks very much,

Janet



On Nov 23, 2010, at 1:46 PM, Janet Young wrote:

Hi Herve,
Thanks very much for this fix, and the other one from the same day(the one about space names as factors vs characters in getSeq).
I haven't been able to test it yet with my real data, as I'm notseeing BSgenome 1.18.2 up yet on the bioc website (I'm usingrelease, not dev), and I'm also not yet getting it when I useupdate.packages - should it be up already? If there's some kind oftimelag I should wait for, that's fine, but I'd thought things wereusually pretty instantaneous (or perhaps overnight), so thought I'dbetter check back with you whether the new version somehow failed topost.
I can understand the problem with getSeq - for me, I think yourworkaround #2 will be totally fine (I'd actually started with seqsspecified in RangedData objects, including start and end coords - Ionly started using names without coords to isolate the problem).
The option to disable or enable the grep feature sounds like a goodfix. For me, it seems more natural that the default should be todisable grep, but perhaps there too many other users who've got usedto expecting the grep. Either way, it'd be great to explain it inthe getSeq help page - I imagine that'll be the case in 1.18.2.
Janet




On Nov 18, 2010, at 10:15 PM, Hervé Pagès wrote:
On 11/15/2010 06:51 PM, Janet Young wrote:
Hi again,

I'm interested in some sequences on the cow chrUn scaffolds, and am
having a bit of bother getting them. I think I might haveuncovered abug, although I might just be doing something wrong. The code andoutput
below should explain all. Any suggestions?

thanks (again!),
Oops, you found another one. More below...
Janet Young

-------------------------------------------------------------------

Dr. Janet Young (Trask lab)

Fred Hutchinson Cancer Research Center
1100 Fairview Avenue N., C3-168,
P.O. Box 19024, Seattle, WA 98109-1024, USA.

tel: (206) 667 1471 fax: (206) 667 6524
email: jayoung ...at... fhcrc.org

http://www.fhcrc.org/labs/trask/

-------------------------------------------------------------------


library(BSgenome.Btaurus.UCSC.bosTau4)

###### this works, gives me a 50bp sequence
getSeq(Btaurus,"chr1",start=1,end=50)
[1] "TACCCCACTCACACTTATGGATAGATCAACTAAACAGAAAATTAACAAGG"
####### for some scaffolds in the chrUn.scaffolds pile, I don'tget anerror message, but getSeqs seems to ignore the start and endcoordinatesrequested - e.g. the sequence returned here is the whole scaffold,not
just the first 50bp
getSeq(Btaurus,"chrUn.004.11829",start=1,end=50)
[1]
"TCATGTGTTTCTTCCAGTCCAGCATTTCTCATGATGTACTCTGCATATAAGTTAAATAAACAGGGTGACAATATACAGCCTTGATGAACTCCTTTTCCTATTTGGAACCAGTCTGTTGTTCCATGTCCAGTTCTAACTGTTGCTTCCTGACCTGCATACAGATTTCTCAAGAGGCAGATCAGGTGTTCTCATCTCCTGAGAATTGAAGGTACAAATTGTAGTGTTTCAATTGGCACCATGCTAATTTATCTTGGCCTAAAATAGTGAATGGGCTTCCCTGGTGGCTCAGGTGGTAAAGAATCTGCCTGCAATGCTGGAGACCTGGGTTCAATATCTGGGTTGGGAAGATTACCTGGAGGAGGGCATGGAGGCTTACTCGAATATTCTTGCCTGGAAAATCTCCATGGACAGAGAAGCTGGGTGGGTTACTGTCCATGGGGTCGCAAAGAGTCAGACGTGACTGAGCAACTAAGCACAGCACAACACAAAATAGTGAATACTGAGCAAGTAAAGGAAAAACCTCTTCCTCTCAGAAATTGGTCTTCATTTTTTCATGAGAATTGCTAGTCTTCCTCCCAAAGCCAAAACCATAAATTTGTTAGTGTTTGACCTCAATATATTTTCTCTTAACTCAGCTTTTAAACCTTCTCTGCCTCCTGCTACCATTCACTTTCTAGTACATTTGAAATCTGTCCAAGCCATTCCTGGGGTTCAGGTGTCTGAGACCTGATTTATTTCATTGATATATTAAAACACCCTTGAATCCAGCCAACGTATGTGGCCAGTTTTACTTGCTTTGCTCCCATACTGGTAATGGAATTTTTATGGCTGTAAAATATCTGGGTCATGTGGCATTTTCATCTTCTGTTGTCTTGAGCTGGTATAGTTTTACCAACGTGCCATTAAGGGATGGTTCCTTTACCATCATTGTGCTTCCTGGGGCCTTGCCCACTTTGCACTGTAAGTCAGAACAAGAGACCCTCCAAG
TATTTAATTTCC"

Yes, the new code was ignoring the 'end'. This is now fixed.
#### for other scaffolds I just get an error message, although thenamedscaffold definitely exists (is it doing a partial match on thename, not
an exact match?)
Yes it's doing partial matching (it uses grep()). This feature was
added a long time ago on a user's request. Note that the BSgenome
package for Cow is peculiar with its chrUn.scaffolds multiplesequence
object. Multiple sequences objects in the BSgenome packages are
generally the upstream* objects which contain long/ugly sequence
names like

> names(Btaurus$upstream1000)[1:5]
[1] "BC118489_up_1000_chr1_142608595_r chr1:142608595-142609594"
[2] "BC151644_up_1000_chr1_44930841_f chr1:44930841-44931840"
[3] "BC116005_up_1000_chr1_46052279_f chr1:46052279-46053278"
[4] "BC133294_up_1000_chr1_44020590_f chr1:44020590-44021589"
[5] "BC109604_up_1000_chr1_65024340_r chr1:65024340-65025339"

so it seemed convenient to be able to specify only part of a name,
at least when working interactively.
Currently Cow and Zebrafish are the only BSgenome packages with
multiple sequences objects other than the upstream[125]000 objects.

Thanks to your feedback, it is now clear that the "grep feature" is
not always desirable. Here are 2 workarounds:
1) Modify the regular expression so that it will match the wholething:
> getSeq(Btaurus,"^chrUn.004.1022$", as.character=FALSE)
  54139-letter "DNAString" instance
seq:AAATCTAGAGTAGAGTTTGGAATTTCAGATATACAA...GTGCTGCAATTCATGGGGTCACAAAGAGTCAGACAT
('as.character=FALSE' used here only to produce a compact output)
Note that this is not completely safe because the dots in
"chrUn.004.1022" could match any character. Using "^chrUn\\.004\\.1022$"
would be safer. But, performance aside, having to escape all special
characters in the regex just to have it behave like if fixed=TRUE
was passed to the call to grep() doesn't sound very appealing.

2) So here is another one. With BSgenome 1.18.2, the "grep feature"
will be disable when you use a high level object like GRanges or
RangedData. So you will be able to do something like:
> getSeq(Btaurus, GRanges("chrUn.004.1022", IRanges(start=1,end=10)), as.character=FALSE)
  A DNAStringSet instance of length 1
    width seq
[1]    10 AAATCTAGAG

But this workaround has its issues too (e.g. if I want to get the
entire sequence I need to know its end in advance).

OK none of those workarounds is totally satisfying so I'll try to
come up with something better for BSgenome 1.18.3 (maybe add an
'exact.match' argument to getSeq() that would be FALSE by default?)
Suggestions are welcome.

Thanks for your feedback,
H.
getSeq(Btaurus,"chrUn.004.1022")
Error in .getOneSeqFromBSgenomeMultipleSequences(x, names[i],start[i], :sequence chrUn.004.1022 found more than once, please use a non-ambiguous
name

which ( names(Btaurus$chrUn.scaffolds) == "chrUn.004.1022" )
[1] 1022

grep ( "chrUn.004.1022" , names(Btaurus$chrUn.scaffolds) )
[1] 1022 10220 10221 10222 10223 10224 10225 10226 10227 10228 10229

sessionInfo()

R version 2.12.0 (2010-10-15)
Platform: i386-apple-darwin9.8.0/i386 (32-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] BSgenome.Btaurus.UCSC.bosTau4_1.3.16 BSgenome_1.18.1
[3] Biostrings_2.18.0 GenomicRanges_1.2.1
[5] IRanges_1.8.2

loaded via a namespace (and not attached):
[1] Biobase_2.10.0

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: [email protected]
Phone:  (206) 667-5791
Fax:    (206) 667-1319


_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Re: [Bioc-sig-seq] getSeq and Btaurus$chrUn.scaffolds - ambiguous name error

Reply via email to