Re: [R] Matching multiple search criteria (Unlisting a nested dataset, take 2)

Nathan Parsons Wed, 17 Oct 2018 09:21:02 -0700

I do not have your command of base r, Bert. That is a herculean effort! Here’s 
what I spent my night putting together:


## Create search terms
## dput(st)
st <- structure(list(word1 = c("technique", "me", "me", "feel", "feel"
), word2 = c("olympic", "abused", "hurt", "hopeless", "alone"
), word3 = c("lifts", "depressed", "depressed", "depressed",
"depressed")), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-5L))

## Create tweets
## dput(th)
th <- structure(list(status_id = c("x1047841705729306624", 
"x1046966595610927105",
"x1047094786610552832", "x1046988542818308097", "x1046934493553221632",
"x1047227442899775488", "x1048126008941981696", "x1047798782673543173",
"x1048269727582355457", "x1048092408544677890"), created_at = 
c("2018-10-04T13:31:45Z",
"2018-10-02T03:34:22Z", "2018-10-02T12:03:45Z", "2018-10-02T05:01:35Z",
"2018-10-02T01:26:49Z", "2018-10-02T20:50:53Z", "2018-10-05T08:21:28Z",
"2018-10-04T10:41:11Z", "2018-10-05T17:52:33Z", "2018-10-05T06:07:57Z"
), text = c("technique is everything with olympic lifts ! @ body by john ",
"@subtronics just went back and rewatched ur fblice with ur cdjs and let me 
tell you man. you are the fucking messiah",
"@ic4rus1 opportunistic means short-game. as in getting drunk now vs. not being 
hung over tomorrow vs. not fucking up your life ten years later.",
"i tend to think about my dreams before i sleep.", "@michaelavenatti 
@senatorcollins so if your client was in her 20s attending parties with 
teenagers doesnt that make her at the least immature as hell or at the worst a 
pedophile and a person contributing to the delinquency of minors?",
"i wish i could take credit for this", "i woulda never imagined. #lakeshow ",
"@philipbloom @blackmagic_news its ok phil! i feel your pain! ",
"sunday ill have a booth in katy at the real craft wives of katy fest 
@nolabelbrewco cmon yall!everything is better when you top it with tias!order 
today we ship to all 50 ",
"dolly is so baddd"), lat = c(43.6835853, 40.284123, 37.7706565,
40.431389, 31.1688935, 33.9376735, 34.0207895, 44.900818, 29.7926,
32.364145), lng = c(-70.3284118, -83.078589, -122.4359785, -79.9806895,
-100.0768885, -118.130426, -118.4119065, -89.5694915, -95.8224,
-86.2447285), county_name = c("Cumberland County", "Delaware County",
"San Francisco County", "Allegheny County", "Concho County",
"Los Angeles County", "Los Angeles County", "Marathon County",
"Harris County", "Montgomery County"), fips = c(23005L, 39041L,
6075L, 42003L, 48095L, 6037L, 6037L, 55073L, 48201L, 1101L),
state_name = c("Maine", "Ohio", "California", "Pennsylvania",
"Texas", "California", "California", "Wisconsin", "Texas",
"Alabama"), state_abb = c("ME", "OH", "CA", "PA", "TX", "CA",
"CA", "WI", "TX", "AL"), urban_level = c("Medium Metro",
"Large Fringe Metro", "Large Central Metro", "Large Central Metro",
"NonCore (Nonmetro)", "Large Central Metro", "Large Central Metro",
"Small Metro", "Large Central Metro", "Medium Metro"), urban_code = c(3L,
2L, 1L, 1L, 6L, 1L, 1L, 4L, 1L, 3L), population = c(277308L,
184029L, 830781L, 1160433L, 4160L, 9509611L, 9509611L, 127612L,
4233913L, 211037L), linenumber = 1:10), row.names = c(NA,
10L), class = "data.frame")

## Clean tweets - basically just remove everything we don’t need from the text 
including punctuation and urls
th %>%
mutate(linenumber = row_number(),
text = str_remove_all(text, "[^\x01-\x7F]"),
text = str_remove_all(text, "\n"),
text = str_remove_all(text, ","),
text = str_remove_all(text, "'"),
text = str_remove_all(text, "&"),
text = str_remove_all(text, "<"),
text = str_remove_all(text, ">"),
text = str_remove_all(text, "http[s]?://[[:alnum:].\\/]+"),
text = tolower(text)) -> th

## Create search function that looks for each search term in the provided 
string, evaluates if all three search terms have been found, and returns a 
logical
srchr <- function(df) {
str_detect(df, "olympic") -> a
str_detect(df, "technique") -> b
str_detect(df, "lifts") -> c
ifelse(a == TRUE & b == TRUE & c == TRUE, TRUE, FALSE)
}

## Evaluate tweets for presence of search term
th %>%
mutate(flag = map_chr(text, srchr)) -> th_flagged

As far as I can tell, this works. I have to manually enter each set of search 
terms into the function, which is not ideal. Also, this only generates a 
True/False for each tweet based on one search term - I end up with an 
evaluatory column for each search term that I would then have to collapse 
together somehow. I’m sure there’s a more elegant solution.

--

Nate Parsons
Pronouns: He, Him, His
Graduate Teaching Assistant
Department of Sociology
Portland State University
Portland, Oregon

503-725-9025
503-725-3957 FAX
On Oct 16, 2018, 7:20 PM -0700, Bert Gunter <bgunter.4...@gmail.com>, wrote:
> OK, as no one else has offered a solution, I'll take a whack at it.
>
> Caveats: This is a brute force attempt using R's basic regular expression 
> engine. It is inelegant and barely tested, so likely to be at best incomplete 
> and buggy, and at worst, incorrect. But maybe Nathan or someone else on the 
> list can fix it up. So if (when) it breaks, complain on the list to give 
> someone (almost certainly not me) the opportunity.
>
> The basic idea is that the tweets are just character strings and the search 
> phrases are just character vectors all of whose elements must match 
> "appropriately" -- i.e. they must match whole words -- in the character 
> strings. So my desired output from the code is a list indexed by the search 
> phrases, each of whose components if a logical vector of length the number of 
> tweets each of whose elements = TRUE iff all the words in the search phrase 
> match somewhere in the tweet.
>
> Here's the code(using the data Nathan provided):
>
> > words <- sapply(st[[1]],strsplit,split = " +" )
> ## convert the phrases to a list of character vectors of the words
> ## Result:
> > words
> $`me abused depressed`
> [1] "me"        "abused"    "depressed"
>
> $`me hurt depressed`
> [1] "me"        "hurt"      "depressed"
>
> $`feel hopeless depressed`
> [1] "feel"      "hopeless"  "depressed"
>
> $`feel alone depressed`
> [1] "feel"      "alone"     "depressed"
>
> $`i feel helpless`
> [1] "i"        "feel"     "helpless"
>
> $`i feel worthless`
> [1] "i"         "feel"      "worthless"
>
> > expand.words <-  function(z)lapply(z,function(x)paste0(c("^ *"," "," "),x, 
> > c(" "," "," *$")))
> ## function to create regexes for words when they are at the beginning, 
> middle, or end of tweets
>
> > wordregex <- lapply(words,expand.words)
> ##Result
> ## too lengthy to include
> ##
> > tweets <- th$text
> ##extract the tweets
> > findin <- function(x,y)
>    ## x is a vector of regex patterns
>    ## y is a character vector
>    ## value = vector,vec, with length(vec) == length(y) and vec[i] == TRUE 
> iff any of x matches y[i]
> { apply(sapply(x,function(z)grepl(z,y)), 1,any)
> }
>
> ## add a matching "tweet" to the tweet vector:
> > tweets <- c(tweets," i xxxx worthless yxxc ght feel")
>
> > ans <- 
> > lapply(wordregex,function(z)apply(sapply(z,function(x)findin(x,tweets)), 1, 
> > all))
> ## Result:
> > ans
> $`me abused depressed`
> [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
>
> $`me hurt depressed`
> [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
>
> $`feel hopeless depressed`
> [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
>
> $`feel alone depressed`
> [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
>
> $`i feel helpless`
> [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
>
> $`i feel worthless`
> [1] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
>
> ## None of the tweets match any of the phrases except for the last tweet that 
> I added.
>
> ## Note: you need to add capabilities to handle upper and lower case. See, 
> e.g. ?casefold
>
> Cheers,
> Bert
>
> Bert Gunter
>
> "The trouble with having an open mind is that people keep coming along and 
> sticking things into it."
> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>
>
> > On Tue, Oct 16, 2018 at 3:03 PM Bert Gunter <bgunter.4...@gmail.com> wrote:
> > > The problem wasn't the data tibbles. You posted in html -- which you were 
> > > explictly warned against -- and that corrupted your text (e.g. some 
> > > quotes became "smart quotes", which cannot be properly cut and pasted 
> > > into R).
> > >
> > > Bert
> > >
> > >
> > > > On Tue, Oct 16, 2018 at 2:47 PM Nathan Parsons 
> > > > <nathan.f.pars...@gmail.com> wrote:
> > > > > Argh! Here are those two example datasets as data frames (not 
> > > > > tibbles).
> > > > > Sorry again. This apparently is just not my day.
> > > > >
> > > > >
> > > > > th <- structure(list(status_id = c("x1047841705729306624",
> > > > > "x1046966595610927105",
> > > > >
> > > > > "x1047094786610552832", "x1046988542818308097", 
> > > > > "x1046934493553221632",
> > > > >
> > > > > "x1047227442899775488"), created_at = c("2018-10-04T13:31:45Z",
> > > > >
> > > > > "2018-10-02T03:34:22Z", "2018-10-02T12:03:45Z", 
> > > > > "2018-10-02T05:01:35Z",
> > > > >
> > > > > "2018-10-02T01:26:49Z", "2018-10-02T20:50:53Z"), text = c("Technique 
> > > > > is
> > > > > everything with olympic lifts ! @ Body By John 
> > > > > https://t.co/UsfR6DafZt";,
> > > > >
> > > > > "@Subtronics just went back and rewatched ur FBlice with ur CDJs and 
> > > > > let me
> > > > > tell you man. You are the fucking messiah",
> > > > >
> > > > > "@ic4rus1 Opportunistic means short-game. As in getting drunk now vs. 
> > > > > not
> > > > > being hung over tomorrow vs. not fucking up your life ten years 
> > > > > later.",
> > > > >
> > > > > "I tend to think about my dreams before I sleep.", "@MichaelAvenatti
> > > > > @SenatorCollins So,  if your client was in her 20s, attending parties 
> > > > > with
> > > > > teenagers, doesn't that make her at the least immature as hell, or at 
> > > > > the
> > > > > worst, a pedophile and a person contributing to the delinquency of 
> > > > > minors?",
> > > > >
> > > > >
> > > > > "i wish i could take credit for this"), lat = c(43.6835853, 40.284123,
> > > > >
> > > > > 37.7706565, 40.431389, 31.1688935, 33.9376735), lng = c(-70.3284118,
> > > > >
> > > > > -83.078589, -122.4359785, -79.9806895, -100.0768885, -118.130426
> > > > >
> > > > > ), county_name = c("Cumberland County", "Delaware County", "San 
> > > > > Francisco
> > > > > County",
> > > > >
> > > > > "Allegheny County", "Concho County", "Los Angeles County"), fips = 
> > > > > c(23005L,
> > > > >
> > > > >
> > > > > 39041L, 6075L, 42003L, 48095L, 6037L), state_name = c("Maine",
> > > > >
> > > > > "Ohio", "California", "Pennsylvania", "Texas", "California"),
> > > > >
> > > > >     state_abb = c("ME", "OH", "CA", "PA", "TX", "CA"), urban_level =
> > > > > c("Medium Metro",
> > > > >
> > > > >     "Large Fringe Metro", "Large Central Metro", "Large Central 
> > > > > Metro",
> > > > >
> > > > >     "NonCore (Nonmetro)", "Large Central Metro"), urban_code = c(3L,
> > > > >
> > > > >     2L, 1L, 1L, 6L, 1L), population = c(277308L, 184029L, 830781L,
> > > > >
> > > > >     1160433L, 4160L, 9509611L)), class = "data.frame", row.names = 
> > > > > c(NA,
> > > > >
> > > > > -6L))
> > > > >
> > > > >
> > > > > st <- structure(list(terms = c("me abused depressed", "me hurt 
> > > > > depressed",
> > > > >
> > > > > "feel hopeless depressed", "feel alone depressed", "i feel helpless",
> > > > >
> > > > > "i feel worthless")), row.names = c(NA, -6L), class = c("tbl_df",
> > > > >
> > > > > "tbl", "data.frame"))
> > > > >
> > > > > On Tue, Oct 16, 2018 at 2:39 PM Nathan Parsons 
> > > > > <nathan.f.pars...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Thanks all for your patience. Here’s a second go that is perhaps 
> > > > > > more
> > > > > > explicative of what it is I am trying to accomplish (and hopefully 
> > > > > > in plain
> > > > > > text form)...
> > > > > >
> > > > > >
> > > > > > I’m using the following packages: tidyverse, purrr, tidytext
> > > > > >
> > > > > >
> > > > > > I have a number of tweets in the following form:
> > > > > >
> > > > > >
> > > > > > th <- structure(list(status_id = c("x1047841705729306624",
> > > > > > "x1046966595610927105",
> > > > > >
> > > > > > "x1047094786610552832", "x1046988542818308097", 
> > > > > > "x1046934493553221632",
> > > > > >
> > > > > > "x1047227442899775488"), created_at = c("2018-10-04T13:31:45Z",
> > > > > >
> > > > > > "2018-10-02T03:34:22Z", "2018-10-02T12:03:45Z", 
> > > > > > "2018-10-02T05:01:35Z",
> > > > > >
> > > > > > "2018-10-02T01:26:49Z", "2018-10-02T20:50:53Z"), text = 
> > > > > > c("Technique is
> > > > > > everything with olympic lifts ! @ Body By John 
> > > > > > https://t.co/UsfR6DafZt";,
> > > > > >
> > > > > > "@Subtronics just went back and rewatched ur FBlice with ur CDJs 
> > > > > > and let
> > > > > > me tell you man. You are the fucking messiah",
> > > > > >
> > > > > > "@ic4rus1 Opportunistic means short-game. As in getting drunk now 
> > > > > > vs. not
> > > > > > being hung over tomorrow vs. not fucking up your life ten years 
> > > > > > later.",
> > > > > >
> > > > > > "I tend to think about my dreams before I sleep.", "@MichaelAvenatti
> > > > > > @SenatorCollins So, if your client was in her 20s, attending 
> > > > > > parties with
> > > > > > teenagers, doesn't that make her at the least immature as hell, or 
> > > > > > at the
> > > > > > worst, a pedophile and a person contributing to the delinquency of 
> > > > > > minors?",
> > > > > >
> > > > > > "i wish i could take credit for this"), lat = c(43.6835853, 
> > > > > > 40.284123,
> > > > > >
> > > > > > 37.7706565, 40.431389, 31.1688935, 33.9376735), lng = c(-70.3284118,
> > > > > >
> > > > > > -83.078589, -122.4359785, -79.9806895, -100.0768885, -118.130426
> > > > > >
> > > > > > ), county_name = c("Cumberland County", "Delaware County", "San 
> > > > > > Francisco
> > > > > > County",
> > > > > >
> > > > > > "Allegheny County", "Concho County", "Los Angeles County"), fips =
> > > > > > c(23005L,
> > > > > >
> > > > > > 39041L, 6075L, 42003L, 48095L, 6037L), state_name = c("Maine",
> > > > > >
> > > > > > "Ohio", "California", "Pennsylvania", "Texas", "California"),
> > > > > >
> > > > > > state_abb = c("ME", "OH", "CA", "PA", "TX", "CA"), urban_level = 
> > > > > > c("Medium
> > > > > > Metro",
> > > > > >
> > > > > > "Large Fringe Metro", "Large Central Metro", "Large Central Metro",
> > > > > >
> > > > > > "NonCore (Nonmetro)", "Large Central Metro"), urban_code = c(3L,
> > > > > >
> > > > > > 2L, 1L, 1L, 6L, 1L), population = c(277308L, 184029L, 830781L,
> > > > > >
> > > > > > 1160433L, 4160L, 9509611L)), class = c("data.table", "data.frame"
> > > > > >
> > > > > > ), row.names = c(NA, -6L), .internal.selfref = )
> > > > > >
> > > > > >
> > > > > > I also have a number of search terms in the following form:
> > > > > >
> > > > > >
> > > > > > st <- structure(list(terms = c("me abused depressed", "me hurt 
> > > > > > depressed",
> > > > > >
> > > > > > "feel hopeless depressed", "feel alone depressed", "i feel 
> > > > > > helpless",
> > > > > >
> > > > > > "i feel worthless")), row.names = c(NA, -6L), class = c("tbl_df",
> > > > > >
> > > > > > "tbl", "data.frame”))
> > > > > >
> > > > > >
> > > > > > I am trying to isolate the tweets that contain all of the words in 
> > > > > > each of
> > > > > > the search terms, i.e “me” “abused” and “depressed” from the first 
> > > > > > example
> > > > > > search term, but they do not have to be in order or even next to one
> > > > > > another.
> > > > > >
> > > > > >
> > > > > > I am familiar with the dplyr suite of tools and have been 
> > > > > > attempting to
> > > > > > generate some sort of ‘filter()’ to do this. I am not very familiar 
> > > > > > with
> > > > > > purrr, but there may be a solution using the map function? I have 
> > > > > > also
> > > > > > explored the tidytext ‘unnest_tokens’ function which transforms the 
> > > > > > ’th’
> > > > > > data in the following way:
> > > > > >
> > > > > >
> > > > > > > tidytext::unnest_tokens(th, word, text, token = "tweets") -> tt
> > > > > >
> > > > > > > head(tt)
> > > > > >
> > > > > > status_id created_at lat lng
> > > > > >
> > > > > > 1: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841
> > > > > >
> > > > > > 2: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841
> > > > > >
> > > > > > 3: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841
> > > > > >
> > > > > > 4: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841
> > > > > >
> > > > > > 5: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841
> > > > > >
> > > > > > 6: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841
> > > > > >
> > > > > > county_name fips state_name state_abb urban_level urban_code
> > > > > >
> > > > > > 1: Cumberland County 23005 Maine ME Medium Metro 3
> > > > > >
> > > > > > 2: Cumberland County 23005 Maine ME Medium Metro 3
> > > > > >
> > > > > > 3: Cumberland County 23005 Maine ME Medium Metro 3
> > > > > >
> > > > > > 4: Cumberland County 23005 Maine ME Medium Metro 3
> > > > > >
> > > > > > 5: Cumberland County 23005 Maine ME Medium Metro 3
> > > > > >
> > > > > > 6: Cumberland County 23005 Maine ME Medium Metro 3
> > > > > >
> > > > > > population word
> > > > > >
> > > > > > 1: 277308 technique
> > > > > >
> > > > > > 2: 277308 is
> > > > > >
> > > > > > 3: 277308 everything
> > > > > >
> > > > > > 4: 277308 with
> > > > > >
> > > > > > 5: 277308 olympic
> > > > > >
> > > > > > 6: 277308 lifts
> > > > > >
> > > > > >
> > > > > > but once I have unnested the tokens, I am unable to recombine them 
> > > > > > back
> > > > > > into tweets.
> > > > > >
> > > > > >
> > > > > > Ideally the end result would append a new column to the ‘th’ data 
> > > > > > that
> > > > > > would flag a tweet that contained all of the search words for any 
> > > > > > of the
> > > > > > search terms; so the work flow would look like
> > > > > >
> > > > > > 1) look for all search words for one search term in a tweet
> > > > > >
> > > > > > 2) if all of the search words in the search term are found, create 
> > > > > > a flag
> > > > > > (mutate(flag = 1) or some such)
> > > > > >
> > > > > > 3) do this for all of the tweets
> > > > > >
> > > > > > 4) move on the next search term and repeat
> > > > > >
> > > > > >
> > > > > > Again, my thanks for your patience.
> > > > > >
> > > > > >
> > > > > > --
> > > > > >
> > > > > >
> > > > > > Nate Parsons
> > > > > >
> > > > > > Pronouns: He, Him, His
> > > > > >
> > > > > > Graduate Teaching Assistant
> > > > > >
> > > > > > Department of Sociology
> > > > > >
> > > > > > Portland State University
> > > > > >
> > > > > > Portland, Oregon
> > > > > >
> > > > > >
> > > > > > 503-725-9025
> > > > > >
> > > > > > 503-725-3957 FAX
> > > > > >
> > > > >
> > > > >         [[alternative HTML version deleted]]
> > > > >
> > > > > ______________________________________________
> > > > > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > > > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > > > PLEASE do read the posting guide 
> > > > > http://www.R-project.org/posting-guide.html
> > > > > and provide commented, minimal, self-contained, reproducible code.

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Matching multiple search criteria (Unlisting a nested dataset, take 2)

Reply via email to