Re: [R] Discovering patterns in textual strings

Jeff Reichman Mon, 07 May 2018 14:03:31 -0700

Bert

Here are some examples of the type of text strings I’m dealing with:

??????.??.???

??????.??.??????????

?Torrent? Pro - Torrent App

?Torrent?-Torrent Downloader

1 Pic 8 Words - Syllables

1 Pic 8 Words - Syllables

27043_Spanish songs for children

28.android.com.alpha.horoscope

28.android.com.bravo.horoscope

28.Card Game - Offline

28.card Game Multiplayer

37045_Spanish songs for children

7 Minute Workout for Weight Loss: Daily Cardio App

7 Minute Workout Plus

7 Minute 
Workout_SMA_IA_$2.25_com.popularapp.sevenmins_CD_Android_MEDIUMRECTANGLE_300x250_IAB7

7 Nights at Pizza House - 2

7 Nights at Pizza House 3D

com.zombodroid

com.zombodroid.battle

com.zombodroid.memegenerator

com.zone.talking.pet

com.zone.yinshidaquan

Disney Kingdom

Disney Kingdom_Android

Evite

Evite Invitations

Evite IOS_Evite_IOS_320x50

Excavator Simulator 3D:Sand

Excavator Snow Plow Loader Truck

Flippy Knife

Flippy Knife - 654567

fliptech.iowafmworld

fliptech.serbiafmworld

Floor is lava!

Floor is lava: Escape

Go_Launcher

Go_Launcher_Lite

myyearbook Android

myyearbook.com-MeetMe_Android_300x250_UK

hoping to obtain something like ….

??????.??

Torrent

1 Pic 8 Words

7 Minute Workout

7 Nights at Pizza House

com.zombodroid

com.zone

Disney Kingdom

Flippy Knife

fliptech

Floor is lava

Go_Launcher

myyearbook 

From: Bert Gunter <[email protected]> 
Sent: Saturday, May 5, 2018 2:14 AM
To: [email protected]
Cc: R-help <[email protected]>
Subject: Re: [R] Discovering patterns in textual strings

I am still somewhat confused by your specifications, but others may not be. 
Part of my confusion stems from your failure to provide a reproducible example 
(see e.g. the posting guide linked below).  For example, I cannot tell from 
your text whether the Abc and Bce strings contain one or more spaces at the 
end. I shall assume they may but need not.

Anyway, here is a reproducible example and solution that assumes that the 
substrings/patterns of interest to you occur at the beginning of the strings 
and may or may not be followed by one of "." "_" or " "(space) and then 
possibly further text which should be ignored. Assuming that you are familiar 
with regular expressions, maybe this will help to get you started even if I 
have misunderstood your specifications. If you aren't familiar with regex's, 
maybe the stringr package may provide a gentler interface than using R's raw 
regex functionality. Or maybe someone else can suggest a better approach (which 
is another reason why you should reply to the list, not just me).

z <- c("abc",
       "abc_def",
       "abc.def",
       "abc def",
       "abcd_ef",
       "abcd",
       "e","f")

pats <- unique(sub("^(.+)[. _]+.*", "\\1 <file://1> ", z))

## gives:
> pats
[1] "abc"  "abcd" "e"    "f"  

This gives you the four separate patterns that you could then use to group your 
records, perhaps by:

> lapply(pats,function(x)grep(paste0("^", x,"([_. ]|$)"), z))
[[1]]
[1] 1 2 3 4

[[2]]
[1] 5 6

[[3]]
[1] 7

[[4]]
[1] 8 

That is, indices 1-4 in z are the first group; 5 and 6 are the second; etc.

Cheers,
Bert

Bert Gunter

"The trouble with having an open mind is that people keep coming along and 
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )

On Fri, May 4, 2018 at 9:00 PM, Jeff Reichman <[email protected] 
<mailto:[email protected]> > wrote:

Bert

Thank you for the  link.  Figured there might be something

Regarding your questions

This is from a large 53 Billion records.  The column in question are AdNames 
(Real Time Bidding data)

#1. Generally yes, but not always

#2 Separators could be underscores  (_) or dots (.) as in 1.2.3_ABC .....

#3 Yes. So there could be Abc 123 could be a matching string

This would not be considered a match  ...
abc_something
this.is_a long stringwithabcinthemiddle

The sequence(s) are always are at the beginning (or so it appears).  Out of the 
54 billion records  I am able to pull (SparkR sql) 948,679 unique strings.  It 
is from these unique strings that I (if possible)  want to identify the "key" 
strings.

1.  Abc_1232.niok7j9hd
2.  Abc
3.  Abc.2#348hfk2.njilo
4.  Abc.2
5.  Abc.7
6.  BAdfr_kajdhf98#kjsdh
7.  BAdrf_gofer
948679 ....

So I may have a thousand individuals strings all of which have Abc as a common 
string, or Badrf.  So I am looking to pull "Abc," "BAdrf", etc.  So then I can 
go back and restructure the data to show that any record with 
Abc_1232.niok7j9hd if part of the Abc "Group," or Family ???

Does that help

Jeff

-----Original Message-----
From: Bert Gunter <[email protected] <mailto:[email protected]> > 
Sent: Friday, May 4, 2018 5:41 PM
To: [email protected] <mailto:[email protected]> 
Cc: R-help <[email protected] <mailto:[email protected]> >
Subject: Re: [R] Discovering patterns in textual strings

The answer is, of course, using regular expressions and/or libraries therefor. 
However, I do not think you have defined your problem sufficiently. Some 
questions I have:

1. Do possible patterns to be matched always appear at the beginning of your 
strings?

2. Always together between specified separators ("_"  in your example); or one 
of several specified separators; or otherwise?

3. Do spaces or other nonprinting characters occur in your strings?

e.g. would

abc_something
this.is_a long stringwithabcinthemiddle

be considered matching?
There are undoubtedly other possibilities that I've missed.

You may also find it useful to check this "task view" out for possibilities:
https://cran.r-project.org/web/views/NaturalLanguageProcessing.html

Cheers,
Bert

Bert Gunter

"The trouble with having an open mind is that people keep coming along and 
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )

On Fri, May 4, 2018 at 3:25 PM, Jeff Reichman <[email protected] 
<mailto:[email protected]> > wrote:
> R Help Forum
>
>
>
> Is there a R library (or a way) that I can extract unique character 
> strings, or repeating patterns in textual strings.  Say for example I 
> have the following records:
>
>
>
> Abc_1234_kjhksh_276
>
> Abc
>
> Abc_1234_lakdofyo_324
>
> Bce_876_skdhk_*&^%*&
>
> Bce
>
> Bce_454
>
>
>
> And I would like to see the following results
>
> Abc
>
> Abc_1234
>
> Bce
>
>
>
>
>
> Jeff Reichman
>
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> [email protected] <mailto:[email protected]>  mailing list -- To 
> UNSUBSCRIBE and more, see 
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

        [[alternative HTML version deleted]]

______________________________________________
[email protected] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Discovering patterns in textual strings

Reply via email to