Re: [R] spliting first 10 words in a string

David Winsemius Mon, 01 Nov 2010 15:33:25 -0700


On Nov 1, 2010, at 5:52 PM, Phil Spector wrote:

 -
  Does this example do what you want?
mysentences = c('Here is a sentence that has a bunch of words init','Here is another sentence that also has a bunch of words','Ihave yet another sentence and it also has a whole bunch of words')data.frame(mysentences,do.call(rbind,lapply(strsplit(mysentences,'+'),'[',1:10)))
mysentences X1 X21 Here is a sentence that has a bunch of words in itHere is2 Here is another sentence that also has a bunch of wordsHere is3 I have yet another sentence and it also has a whole bunch ofwords I have
      X3       X4       X5   X6  X7    X8    X9   X10
1       a sentence     that  has   a bunch    of words
2 another sentence     that also has     a bunch    of
3     yet  another sentence  and  it  also   has     a


Matevž;

Be on the alert for what the data.frame function does with charactervectors. Unless you forbid it from doing so it will convert anycharacter vector to a factor. (A major source of confusion for R-newbies.) In the above version you could prevent this in Phil'ssolution by:

data.frame(mysentences,do.call(rbind,lapply(strsplit(mysentences,'+'),'[',1:10)), stringsAsFactors=FALSE)


Or if cbind were applied to my solution at the end of this email:

cbind(worddf, t(sapply(strsplit(worddf$words, " "), "[", 1:10) ) ,stringsAsFactors=FALSE)> str( cbind(worddf, t(sapply(strsplit(worddf$words, " "), "[",1:10) ) , stringsAsFactors=FALSE) )

'data.frame':   3 obs. of  11 variables:

$ words: chr "I have a columnn with text that has quite a few wordsin it." "I would like to split these words in separate columns" "butjust first ten words in the string. Is that possible in R?"

 $ 1    : chr  "I" "I" "but"
 $ 2    : chr  "have" "would" "just"
 $ 3    : chr  "a" "like" "first"
 $ 4    : chr  "columnn" "to" "ten"
 $ 5    : chr  "with" "split" "words"
 $ 6    : chr  "text" "these" "in"
 $ 7    : chr  "that" "words" "the"
 $ 8    : chr  "has" "in" "string."
 $ 9    : chr  "quite" "separate" "Is"
 $ 10   : chr  "a" "columns" "that"

cbind.data.frame is a method that would be invoked for that operation.This result has the disadvantage that the column names will need to beenclosed in quotes to access them with the "$" function since theystart with numerals.


(Or you could just deal with the factor type.)

--
David.


                                        - Phil Spector
                                         Statistical Computing Facility
                                         Department of Statistics
                                         UC Berkeley
                                         spec...@stat.berkeley.edu


On Mon, 1 Nov 2010, Matevž Pavlič wrote:

...I would like i.e. split this sentence from field Opis indata.frame :
Opis : "I have a sentense with ten words", so that it would converto something like this :
Opis : "I have a sentense with then words"; Column1 : "I";Column2 : "have"; Column3 : "a"; Column4 : "sentense"; Column5:"with"; Column6 :"ten";column7:"words"
....or in data.frame something like this (as I understand) :

data.frame':   xx obs. of  12 variables:
$ Opis : factor :"I have a sentense with then words";
$ Column1 : factor  "I";
$ Column2 : factor "have";
$ Column3 : factor "a";
$ Column4 : factor "sentense";
$ Column5: factor "with";
$ Column6 : factor "ten";
$ Column7: factor"words"
Hope that explains it better, I am still having some troublesunderstanding R and all..
m


-----Original Message-----
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Matevž Pavlič
Sent: Monday, November 01, 2010 10:34 PM
To: David Winsemius
Cc: r-help@r-project.org
Subject: Re: [R] spliting first 10 words in a string

Hi,

I am sorry, will try to be more exact from now on...
I have a data.frame with a field called Opis. IT containssentenses that I would like to split in words or fields indata.frame...when I say columns I mean as in Excel table. I wouldlike to split "Opis" into ten fields from the first ten words inOpis field.
Here is an example of my data.frame.

'data.frame':   22928 obs. of  12 variables:
$ VrtinaID        : int  1 1 1 1 2 2 2 2 2 2 ...
$ ZapStev         : int  1 2 3 4 1 2 3 4 5 6 ...
$ GlobinaOd       : num  0 0.8 9.2 10.1 0 0.9 2.6 4.9 6.8 7.3 ...
$ GlobinaDo       : num  0.8 9.2 10.1 11 0.9 2.6 4.9 6.8 7.3 8.2 ...
$ Opis : Factor w/ 12754 levels "","(MIVKA) DROBENMELJAST PESEK, GOST, SIVORJAV",..: 2060 11588 2477 11660 7539 31827884 9123 2500 4756 ...$ ACklasifikacija : Factor w/ 290 levels "","(CL)","(CL)/(SC)",..:154 125 101 101 NA 106 125 80 106 101 ...
$ GeolNastOd      : num  0 0.8 9.2 10.1 0 0.9 2.6 4.9 6.8 7.3 ...
$ GeolNastDo      : num  0.8 9.2 10.1 11 0.9 2.6 4.9 6.8 7.3 8.2 ...
$ GeolNastOpis : Factor w/ 113 levels "","B. M. S.",..: 56 53 5353 56 53 53 53 53 53 ...
$ NacinVrtanjaOd  : num  0e+00 1e+09 1e+09 1e+09 0e+00 ...
$ NacinVrtanjaDo  : num  1.1e+01 1.0e+09 1.0e+09 1.0e+09 1.0e+01 ...
$ NacinVrtanjaOpis: Factor w/ 43 levels "","H. N.","IZKOP",..: 26 11 1 26 1 1 1 1 1 ...
Hope that explains better...
Thank you, m

-----Original Message-----
From: David Winsemius [mailto:dwinsem...@comcast.net]
Sent: Monday, November 01, 2010 10:13 PM
To: Matevž Pavlič
Cc: r-help@r-project.org
Subject: Re: [R] spliting first 10 words in a string


On Nov 1, 2010, at 4:39 PM, Matevž Pavlič wrote:
Hi all,



I have a columnn with text that has quite a few words in it. I would
like to split these words in separate columns, but just first ten
words in the string. Is that possible in R?
Not sure what a column means to you. It's not a precisely defined R
type or class. (And you are requested to offered a concrete example
rather than making us guess.)

>words <-"I have a columnn with text that has quite a few words in
it. I would like to split these words in separate columns, but just
first ten words in the string. Is that possible in R?"

> strsplit(words, " ")[[1]][1:10]
[1] "I"       "have"    "a"       "columnn" "with"    "text"
"that"    "has"     "quite"   "a"


Or if in a dataframe:

> words <-c("I have a columnn with text that has quite a few words in
it.",   "I would like to split these words in separate columns", "but
just first ten words in the string. Is that possible in R?")
> worddf <- data.frame(words=words)

> t(sapply(strsplit(worddf$words, " "), "[", 1:10) )
    [,1]  [,2]    [,3]    [,4]      [,5]    [,6]    [,7]    [,
8]      [,9]       [,10]
[1,] "I"   "have"  "a"     "columnn" "with"  "text"  "that"  "has"
"quite"    "a"
[2,] "I"   "would" "like"  "to"      "split" "these" "words" "in"
"separate" "columns"
[3,] "but" "just" "first" "ten" "words" "in" "the""string."
"Is"       "that"


--
David Winsemius, MD
West Hartford, CT


David Winsemius, MD
West Hartford, CT

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] spliting first 10 words in a string

Reply via email to