[jira] [Commented] (SPARK-39796) Add a regexp_extract variant which returns an array of all the matched capture groups

Pablo Langa Blanco (Jira) Sat, 06 Aug 2022 14:03:07 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-39796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17576281#comment-17576281
 ]


Pablo Langa Blanco commented on SPARK-39796:
--------------------------------------------

Hi [~augustine_theodore] ,

Is this what are you loking for?
{code:java}
scala> val regex = "([A-Za-z]+), [A-Za-z]+, (\\d+)"
regex: String = ([A-Za-z]+), [A-Za-z]+, (\d+)

scala> val df = Seq("Hello, World, 1234", "Good, bye, friend").toDF("a")
df: org.apache.spark.sql.DataFrame = [a: string]

scala> df.withColumn("g1", regexp_extract('a, "([A-Za-z]+), [A-Za-z]+, (\\d+)", 
1)).withColumn("g2", regexp_extract('a, regex, 2)).show
+------------------+-----+----+
|                 a|   g1|  g2|
+------------------+-----+----+
|Hello, World, 1234|Hello|1234|
| Good, bye, friend|     |    |
+------------------+-----+----+{code}

> Add a regexp_extract variant which returns an array of all the matched 
> capture groups
> -------------------------------------------------------------------------------------
>
>                 Key: SPARK-39796
>                 URL: https://issues.apache.org/jira/browse/SPARK-39796
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.2.2
>            Reporter: Augustine Theodore Prince
>            Priority: Minor
>              Labels: regexp_extract, regexp_extract_all, regexp_replace
>
>  
> regexp_extract only returns a single matched group. In a lot of cases we need 
> to parse the entire string and get all the groups and for that we'll need to 
> call it as many times as there are groups. The regexp_extract_all function 
> doesn't solve this problem as it only works if all the groups have the same 
> regex pattern.
>  
> _Example:_
> I will provide an example and the current workaround that I use to solve this,
> If I have the following dataframe and I would like to match the column 'a' 
> with this pattern
> {code:java}
> "([A-Za-z]+), [A-Za-z]+, (\\d+)"{code}
> |a|
> |Hello, World, 1234|
> |Good, bye, friend|
>  
> My expected output  is as follows:
> |a|extracted_a|
> |Hello, World, 1234|[Hello, 1234]|
> |Good, bye, friend|[]|
>  
> However, to achieve this I have to take the following approach which seems 
> very hackish.
> 1. Use regexp_replace to create a temporary string built using the extracted 
> groups:
> {code:java}
> df.withColumn("extr" , F.regexp_replace("a", "([A-Za-z]+), [A-Za-z]+, 
> (\\d+)", "$1_$2")){code}
> A side effect of regexp_replace is that if the regex fails to match the 
> entire string is returned.
>  
> |a|extracted_a|
> |Hello, World, 1234|Hello_1234|
> |Good, bye, friend|Good, bye, friend|
> 2. So, to achieve the desired result, a check has to be done to prune the 
> rows that did not match with the pattern :
> {code:java}
> df = df.withColumn("extracted_a" , F.when(F.col("extracted_a")==F.col("a") , 
> None).otherwise(F.col("extracted_a"))){code}
>  
> to get the following intermediate dataframe,
> |a|extracted_a|
> |Hello, World, 1234|Hello_1234|
> |Good, bye, friend|null|
>  
> 3. Before finally splitting the column 'extracted_a' based on underscores
> {code:java}
> df = df.withColumn("extracted_a" , F.split("extracted_a" , "[_]")){code}
> which results in the desired result:
>  
>  
> |a|extracted_a
> |
> |Hello, World, 1234|[Hello, 1234]|
> |Good, bye, friend|null|
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39796) Add a regexp_extract variant which returns an array of all the matched capture groups

Reply via email to