Hi,
This is easy to do with the Biostrings package (from Bioconductor).
Let's say you've managed to load the string from your
eya4_lagan_HM_cp.txt file:
my_seq <- "123456789012345678901234567890"
What you call "pairs of positions" are "ranges". Let's say you've
managed to the ranges from your data file:
my_ranges <- rbind(c(3, 7), c(12, 13), c(18, 25))
Then:
library(Biostrings)
my_seq <- BString(my_seq)
my_ranges <- IRanges(my_ranges[ ,1], my_ranges[ ,2])
Query 1:
> replaceAt(my_seq, at=my_ranges, value="#")
18-letter "BString" instance
seq: 12#8901#4567#67890
Query 2:
> replaceAt(my_seq, at=my_ranges, value=paste0("#", extractAt(my_seq,
my_ranges), "#"))
36-letter "BString" instance
seq: 12#34567#8901#23#4567#89012345#67890
## Or, equivalently (but more efficiently):
> replaceAt(my_seq, at=c(start(my_ranges), end(my_ranges) + 1),
value="#")
36-letter "BString" instance
seq: 12#34567#8901#23#4567#89012345#67890
You can turn the BString objects back into ordinary strings with
as.character().
To install the Biostrings package:
source("http://bioconductor.org/biocLite.R")
biocLite("Biostrings")
Cheers,
H.
On 01/23/2014 11:04 AM, arun wrote:
Hi,
Try:
CDS1 <- read.table("CDS coordinates.txt",header=FALSE)
CDS2 <-
split(CDS1[,1],as.numeric(as.character(gl(nrow(CDS1),2,length=nrow(CDS1)))))
eya4 <- readChar("eya4_lagan_HM_cp.txt",file.info("eya4_lagan_HM_cp.txt")$size)
eyaSpl<- head(strsplit(eya4,"")[[1]],-1)
length(eyaSpl)
#[1] 311522
eyaSpl1 <- eyaSpl
##1
for(i in seq_along(CDS2)){
eyaSpl1[seq(CDS2[[i]][1],CDS2[[i]][2],by=1)] <- "#"
eyaSpl1}
##2
eyaSpl2 <- rep("#",sum(length(eyaSpl),length(CDS1[,1])))
vec1 <- unlist(lapply(CDS2,function(x) c(x[1]-1,x[2]+1)),use.names=FALSE)
eyaSpl2[-vec1] <- eyaSpl
eyaSpl2New <- paste(eyaSpl2,collapse="")
A.K.
I have a data file here, which is imported into R by:
eya4_lagan_HM_cp <- "E:/blahblah/eya4_lagan_HM_cp.txt"
eya4_lagan_HM_cp <- readChar(eya4_lagan_HM_cp,
file.info(eya4_lagan_HM_cp)$size)
Label the first string with position "1" and the last string
as position "311,522" (note the sequence contains in total 311,522
characters). I have two queries which are closely related.
**Query 1)**
Now I have a data file with a list of positions here. The positions are read in
"pairs", that is, take the first pair 44184
and 44216 as an example. I wish to delete the subsequence from position
44184 (inclusive) to position 44216 (inclusive) from the previous
sequence `eya4_lagan_HM_cp` and in its place, insert the character #. In other
words, substitute the subsequence from 44184 to 44216 with #. I
would like to do this with the rest of the pairs, that is, for 151795
and 151844, I want to delete from position 151795 (inclusive) to 151844
(inclusive) in `eya4_lagan_HM_cp` and replace it with #, and so on.
**Query 2)**
Now I would like to do something slightly different with the
data file with the list of positions. Take the first pair as an example
again. I would like to insert a # right before position 44184, in other words,
insert a # between positions 44183 and 44184 in
`eya4_lagan_HM_cp` and then I would like to insert a # right after position
44216, i.e., insert a # between positions 44216 and 44217. I would like to
repeat this procedure for all position pairs. So for the next pair, I would
like a # right before 151795 and a # right after 151844.
Thank you.
______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: [email protected]
Phone: (206) 667-5791
Fax: (206) 667-1319
______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.