Re: [R] Regex Split?

Leonard Mada via R-help Fri, 05 May 2023 14:54:42 -0700

Dear Avi,

Punctuation marks are used in various NLP language models. Preservingthe "," is therefore useful in such scenarios and Regex are useful toaccomplish this (especially if you have sufficient experience with suchexpressions).

I observed only an odd behaviour using strsplit: the example string isconstructed; but it is always wise to test a Regex expression againstvarious scenarios. It is usually hard to predict what special cases willoccur in a specific corpus.


strsplit("a bc,def, adef ,,gh", " |(?=,)|(?<=,)(?![ ])", perl=T)
# "a"  "bc"  ","  "def"  ","  ""  "adef"  ","  ","  "gh"

stringi::stri_split("a bc,def, adef ,,gh", regex=" |(?=,)|(?<=,)(?![ ])")
# "a"    "bc"   ","    "def"  ","    "adef"  ""     ","    "," "gh"

stringi::stri_split("a bc,def, adef ,,gh", regex=" |(?<!)(?=,)|(?<=,)(?![ ])")

# "a"    "bc"   ","    "def"  ","    "adef"  ","    ","    "gh"

# Expected:
# "a"  "bc"   ","  "def"   ","  "adef"  ","   ","  "gh"
# see 2nd instance of stringi::stri_split


Sincerely,


Leonard


On 5/5/2023 11:20 PM, avi.e.gr...@gmail.com wrote:

Leonard,

It can be helpful to spell out your intent in English or some of us have to go 
back to the documentation to remember what some of the operators do.

Your text being searched seems to be an example of items between comas with an 
optional space after some commas and in one case, nothing between commas.

So what is your goal for the example, and in general? You mention a bit 
unclearly at the end some of what you expect and I think it would be clearer if 
you also showed exactly the output you would want.

I saw some other replies that addressed what you wanted and am going to reply 
in another direction.

Why do things the hard way using things like lookahead or look behind? Would 
several steps get you the result way more clearly?

For the sake of argument, you either want what reading in a CSV file would 
supply, or something else. Since you are not simply splitting on commas, it 
sounds like something else. But what exactly else? Something as simple as this 
on just a comma produces results including empty strings and embedded leading 
or trailing spaces:

strsplit("a bc,def, adef ,,gh", ",")
[[1]]
[1] "a bc"   "def"    " adef " ""       "gh"

That can of course be handled by, for example, trimming the result after 
unlisting the odd way strsplit returns results:

library("stringr")
str_squish(unlist(strsplit("a bc,def, adef ,,gh", ",")))

[1] "a bc" "def"  "adef" ""     "gh"

Now do you want the empty string to be something else, such as an NA? That can 
be done too with another step.

And a completely different variant can be used to read in your one-line CSV as 
text using standard overkill tools:

read.table(text="a bc,def, adef ,,gh", sep=",")

     V1  V2     V3 V4 V5
1 a bc def  adef  NA gh

The above is a vector of texts. But if you simply want to reassemble your 
initial string cleaned up a bit, you can use paste to put back commas, as in a 
variation of the earlier example:

paste(str_squish(unlist(strsplit("a bc,def, adef ,,gh", ","))), collapse=",")

[1] "a bc,def,adef,,gh"

So my question is whether using advanced methods is really necessary for your 
case, or even particularly efficient. If efficiency matters, often, it is 
better to use tools without regular expressions such as paste0() when they meet 
your needs.

Of course, unless I know what you are actually trying to do, my remarks may be 
not useful.



-----Original Message-----
From: R-help <r-help-boun...@r-project.org> On Behalf Of Leonard Mada via R-help
Sent: Thursday, May 4, 2023 5:00 PM
To: R-help Mailing List <r-help@r-project.org>
Subject: [R] Regex Split?

Dear R-Users,

I tried the following 3 Regex expressions in R 4.3:
strsplit("a bc,def, adef ,,gh", " |(?=,)|(?<=,)(?![ ])", perl=T)
# "a"    "bc"   ","    "def"  ","    ""     "adef" ","    "," "gh"

strsplit("a bc,def, adef ,,gh", " |(?<! )(?=,)|(?<=,)(?![ ])", perl=T)
# "a"    "bc"   ","    "def"  ","    ""     "adef" ","    "," "gh"

strsplit("a bc,def, adef ,,gh", " |(?<! )(?=,)|(?<=,)(?=[^ ])", perl=T)
# "a"    "bc"   ","    "def"  ","    ""     "adef" ","    "," "gh"


Is this correct?


I feel that:
- none should return (after "def"): ",", "";
- the first one could also return "", "," (but probably not; not fully
sure about this);


Sincerely,


Leonard

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://eu01.z.antigena.com/l/boS91wizs77ZHrpn6fDgE-TZu7JxUnjyNg_9mZDUsLWLylcL-dhQytfeUHheLHZnKJw-VwwfCd_W4XdAukyKenqYPFzSJmP5FrWmF_wepejCrBByUVa66jUF7wKGiA8LnqB49ZUVq-urjKs272Rl-mj-SE1q7--Xj1UXRol3
PLEASE do read the posting guide 
https://eu01.z.antigena.com/l/rUS82cEKjOa3tTqQ7yTAXLpuOWG1NttoMdEKDQkk3EZhrLW63rsvJ77vuFxoc44Nwo7BGuQyBzF3bNlYLccamhXBk0shpe_1ZhOeonqIbTm59I58PKOPwwqUt6gLF2fLg3OmstDk7ueraKARO4qpUToOguMdYKyE2_LZnBk7QR
and provide commented, minimal, self-contained, reproducible code.


______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Regex Split?

Reply via email to