Thank you again for your help and giving me the opportunity to choose the efficient method. For a small data set there is no discernable difference between the different approaches. I will carry out a comparison using the large data set.
On Wed, Sep 23, 2020 at 11:52 AM LMH <lmh_users-gro...@molconn.com> wrote: > > Below is a script in bash the uses the awk tokenizer to do the work. > > This assumes that your input and output delimiter is space. The number of > consecutive delimiters in > the input is not important. This also assumes that the input file does not > have a header row. That > is easy to modify if you want. I always keep header rows in my data files as > I think that removing > them is asking for trouble down the road. > > I added a NULL for cases where there is no value for the last field. You > could use "." if you want. > > You should be able to find how to run this from inside R if you want. You > will, of course, need a > bash environment to run this, so if you are not in linux you will need cygwin > or something similar. > > This should be very fast, but let me know if needs to be faster. If the X1_X2 > variant occurs less > frequently than not then we should switch the order in which the logic > evaluates the options. > > LMH > > > #! /bin/bash > > # input filename > input_file=$1 > > # output filename > output_file=$2 > > # make sure the input file exists > if [ ! -f $input_file ]; then > echo $input_file " cannot be found" > exit 0 > fi > > # create the output file > touch $output_file > > # make sure the output was created > if [ ! -f $output_file ]; then > echo $output_file " was not created" > exit 0 > fi > > # write the header row > echo "ID1 ID2 Y1 X1 X2" >> $output_file > > # character to find in the third token > look_for='_' > > # process with awk > # if the 3rd token contains '_' > # split the third token on '_' into F[1] and F[2] > # print the first two tokens, the indicator value of 1, and the split > fields F[1] and F[2] > # otherwise, > # print the first two tokens, the indicator value of 0, the 3rd token, and > NULL > > cat $input_file | \ > awk -v find_char=$look_for '{ if($3 ~ find_char) { { split ($3, F, "_") } > { print $1, $2, "1", F[1], > F[2] } > } > else { print $1, $2, "0", $3, "NULL" } > }' >> $output_file > > > > > > > > Val wrote: > > Thank you all for the help! > > > > LMH, Yes I would like to see the alternative. I am using this for a > > large data set and if the alternative is more efficient than this > > then I would be happy. > > > > On Tue, Sep 22, 2020 at 6:25 PM Bert Gunter <bgunter.4...@gmail.com> wrote: > >> > >> To be clear, I think Rui's solution is perfectly fine and probably better > >> than what I offer below. But just for fun, I wanted to do it without the > >> lapply(). Here is one way. I think my comments suffice to explain. > >> > >>> ## which are the non "_" indices? > >>> wh <- grep("_",F1$text, fixed = TRUE, invert = TRUE) > >>> ## paste "_." to these > >>> F1[wh,"text"] <- paste(F1[wh,"text"],".",sep = "_") > >>> ## Now strsplit() and unlist() them to get a vector > >>> z <- unlist(strsplit(F1$text, "_")) > >>> ## now cbind() to the data frame > >>> F1 <- cbind(F1, matrix(z, ncol = 2, byrow = TRUE)) > >>> F1 > >> ID1 ID2 text 1 2 > >> 1 A1 B1 NONE_. NONE . > >> 2 A1 B1 cf_12 cf 12 > >> 3 A1 B1 NONE_. NONE . > >> 4 A2 B2 X2_25 X2 25 > >> 5 A2 B3 fd_15 fd 15 > >>> ## You can change the names of the 2 columns yourself > >> > >> Cheers, > >> Bert > >> > >> Bert Gunter > >> > >> "The trouble with having an open mind is that people keep coming along and > >> sticking things into it." > >> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) > >> > >> > >> On Tue, Sep 22, 2020 at 12:19 PM Rui Barradas <ruipbarra...@sapo.pt> wrote: > >>> > >>> Hello, > >>> > >>> A base R solution with strsplit, like in your code. > >>> > >>> F1$Y1 <- +grepl("_", F1$text) > >>> > >>> tmp <- strsplit(as.character(F1$text), "_") > >>> tmp <- lapply(tmp, function(x) if(length(x) == 1) c(x, ".") else x) > >>> tmp <- do.call(rbind, tmp) > >>> colnames(tmp) <- c("X1", "X2") > >>> F1 <- cbind(F1[-3], tmp) # remove the original column > >>> rm(tmp) > >>> > >>> F1 > >>> # ID1 ID2 Y1 X1 X2 > >>> #1 A1 B1 0 NONE . > >>> #2 A1 B1 1 cf 12 > >>> #3 A1 B1 0 NONE . > >>> #4 A2 B2 1 X2 25 > >>> #5 A2 B3 1 fd 15 > >>> > >>> > >>> Note that cbind dispatches on F1, an object of class "data.frame". > >>> Therefore it's the method cbind.data.frame that is called and the result > >>> is also a df, though tmp is a "matrix". > >>> > >>> > >>> Hope this helps, > >>> > >>> Rui Barradas > >>> > >>> > >>> Às 20:07 de 22/09/20, Rui Barradas escreveu: > >>>> Hello, > >>>> > >>>> Something like this? > >>>> > >>>> > >>>> F1$Y1 <- +grepl("_", F1$text) > >>>> F1 <- F1[c(1, 2, 4, 3)] > >>>> F1 <- tidyr::separate(F1, text, into = c("X1", "X2"), sep = "_", fill = > >>>> "right") > >>>> F1 > >>>> > >>>> > >>>> Hope this helps, > >>>> > >>>> Rui Barradas > >>>> > >>>> Às 19:55 de 22/09/20, Val escreveu: > >>>>> HI All, > >>>>> > >>>>> I am trying to create new columns based on another column string > >>>>> content. First I want to identify rows that contain a particular > >>>>> string. If it contains, I want to split the string and create two > >>>>> variables. > >>>>> > >>>>> Here is my sample of data. > >>>>> F1<-read.table(text="ID1 ID2 text > >>>>> A1 B1 NONE > >>>>> A1 B1 cf_12 > >>>>> A1 B1 NONE > >>>>> A2 B2 X2_25 > >>>>> A2 B3 fd_15 ",header=TRUE,stringsAsFactors=F) > >>>>> If the variable "text" contains this "_" I want to create an indicator > >>>>> variable as shown below > >>>>> > >>>>> F1$Y1 <- ifelse(grepl("_", F1$text),1,0) > >>>>> > >>>>> > >>>>> Then I want to split that string in to two, before "_" and after "_" > >>>>> and create two variables as shown below > >>>>> x1= strsplit(as.character(F1$text),'_',2) > >>>>> > >>>>> My problem is how to combine this with the original data frame. The > >>>>> desired output is shown below, > >>>>> > >>>>> > >>>>> ID1 ID2 Y1 X1 X2 > >>>>> A1 B1 0 NONE . > >>>>> A1 B1 1 cf 12 > >>>>> A1 B1 0 NONE . > >>>>> A2 B2 1 X2 25 > >>>>> A2 B3 1 fd 15 > >>>>> > >>>>> Any help? > >>>>> Thank you. > >>>>> > >>>>> ______________________________________________ > >>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > >>>>> https://stat.ethz.ch/mailman/listinfo/r-help > >>>>> PLEASE do read the posting guide > >>>>> http://www.R-project.org/posting-guide.html > >>>>> and provide commented, minimal, self-contained, reproducible code. > >>>>> > >>>> > >>>> ______________________________________________ > >>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > >>>> https://stat.ethz.ch/mailman/listinfo/r-help > >>>> PLEASE do read the posting guide > >>>> http://www.R-project.org/posting-guide.html > >>>> and provide commented, minimal, self-contained, reproducible code. > >>> > >>> ______________________________________________ > >>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > >>> https://stat.ethz.ch/mailman/listinfo/r-help > >>> PLEASE do read the posting guide > >>> http://www.R-project.org/posting-guide.html > >>> and provide commented, minimal, self-contained, reproducible code. > > > > ______________________________________________ > > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > > ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.