HI, May be this helps: lines1 <- readLines(textConnection('text to be ignored... CDS 687..3158 /gene="AXL2" /note="plasma membrane glycoprotein"
other text to be ignored... CDS complement(3300..4037) /gene="REV7" other text to be ignored... CDS <4500..4550 /gene="REV7" other text to be ignored... CDS complement(join(30708..31700,31931..31984)) /gene="REV7"')) lines2 <- lines1[grep("CDS",lines1)] lines3 <- lines2[!grepl("[<>]",lines2)] indx <- grepl("complement",lines3)*1 mapply(`c`,indx,strapply(lines3,"([0-9]+)",as.numeric)) #[[1]] #[1] 0 687 3158 # #[[2]] #[1] 1 3300 4037 # #[[3]] #[1] 1 30708 31700 31931 31984 If you want to have "," as sep: lapply(mapply(`c`,indx,strapply(lines3,"([0-9]+)",as.numeric)),paste,collapse=", ") A.K. For sure, maybe I could provide a more realistic sample of what I have rather than the vector. Here is a chunk of the text I'll be processing: text to be ignored... CDS 687..3158 /gene="AXL2" /note="plasma membrane glycoprotein" other text to be ignored... CDS complement(3300..4037) /gene="REV7" other text to be ignored... CDS <4500..4550 /gene="REV7" other text to be ignored... CDS complement(join(30708..31700,31931..31984)) /gene="REV7" and so on ... processing this text, I want the following output (let's say) in a list called output with as many elements as there are valid "CDS" (i.e. CDS without "<" or ">"), where the first component of each element of the list is a 0/1 number that tells if what followed "CDS" included the word "complement" or not. Here is what I would like to get for the above text: output: [[1]] 0, 687, 3158 [[2]] 1, 3300, 4037 [[3]] 1, 30708, 31700, 31931, 31984 Thanks again for the help! Thank you very much for the response! This is a major improvement on what I was getting! I need to read and understand what is done as I need to modify it a little bit. The exact requirement for me is to not only recognize the numbers that follow "CDS" but also be able to differentiate between the 4 accepted cases: "CDS 3300..4037" or "CDS complement(3300..4037)" or "CDS join(21467..26641,27577..28890)" or "CDS complement(join(30708..31700,31931..31984))" I need to do different things for each for example, when "join" follows the gap, I need to join the ranges (e.g. in this case have two intervals [21467 26641] U [27577 28890]) in one set. Many thanks though for getting me going! On Thursday, February 6, 2014 2:20 PM, arun <smartpink...@yahoo.com> wrote: You could also try: library(gsubfn) strapply(gsub("\\d+<|>\\d+","",vec1),"([0-9]+)",as.numeric,simplify=c) A.K. On Thursday, February 6, 2014 1:55 PM, arun <smartpink...@yahoo.com> wrote: Hi, One way would be: vec1 <- c("CDS 3300..4037", "CDS complement(3300..4037)", "CDS 3300<..4037", "CDS join(21467..26641,27577..28890)", "CDS complement(join(30708..31700,31931..31984))", "CDS 3300<..>4037") library(stringr) as.numeric(unlist(strsplit(str_trim(gsub("\\D+"," ",gsub("\\d+<|>\\d+","",vec1)))," "))) # [1] 3300 4037 3300 4037 4037 21467 26641 27577 28890 30708 31700 31931 #[13] 31984 A.K. Hi, I have been using R for the past 1.5 years and usually have found topics to be relatively easy to learn on your own, but I am finding the learning curve with the regular expressions to be a little steep especially since I haven't found any good tutorials. While I intend to spend more time systematically learning proper ways of making regular expressions, I have a project that is coming due and can't wait for that so I was hoping to get some direct help. I need to extract all the numbers in lines with following formats: "CDS 3300..4037" or "CDS complement(3300..4037)" or "CDS join(21467..26641,27577..28890)" or "CDS complement(join(30708..31700,31931..31984))" but not if any of the numbers are preceded by "<" or followed by ">" Many thanks in advance! ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.