[ https://issues.apache.org/jira/browse/SPARK-39796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17576281#comment-17576281 ]
Pablo Langa Blanco commented on SPARK-39796: -------------------------------------------- Hi [~augustine_theodore] , Is this what are you loking for? {code:java} scala> val regex = "([A-Za-z]+), [A-Za-z]+, (\\d+)" regex: String = ([A-Za-z]+), [A-Za-z]+, (\d+) scala> val df = Seq("Hello, World, 1234", "Good, bye, friend").toDF("a") df: org.apache.spark.sql.DataFrame = [a: string] scala> df.withColumn("g1", regexp_extract('a, "([A-Za-z]+), [A-Za-z]+, (\\d+)", 1)).withColumn("g2", regexp_extract('a, regex, 2)).show +------------------+-----+----+ | a| g1| g2| +------------------+-----+----+ |Hello, World, 1234|Hello|1234| | Good, bye, friend| | | +------------------+-----+----+{code} > Add a regexp_extract variant which returns an array of all the matched > capture groups > ------------------------------------------------------------------------------------- > > Key: SPARK-39796 > URL: https://issues.apache.org/jira/browse/SPARK-39796 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 3.2.2 > Reporter: Augustine Theodore Prince > Priority: Minor > Labels: regexp_extract, regexp_extract_all, regexp_replace > > > regexp_extract only returns a single matched group. In a lot of cases we need > to parse the entire string and get all the groups and for that we'll need to > call it as many times as there are groups. The regexp_extract_all function > doesn't solve this problem as it only works if all the groups have the same > regex pattern. > > _Example:_ > I will provide an example and the current workaround that I use to solve this, > If I have the following dataframe and I would like to match the column 'a' > with this pattern > {code:java} > "([A-Za-z]+), [A-Za-z]+, (\\d+)"{code} > |a| > |Hello, World, 1234| > |Good, bye, friend| > > My expected output is as follows: > |a|extracted_a| > |Hello, World, 1234|[Hello, 1234]| > |Good, bye, friend|[]| > > However, to achieve this I have to take the following approach which seems > very hackish. > 1. Use regexp_replace to create a temporary string built using the extracted > groups: > {code:java} > df.withColumn("extr" , F.regexp_replace("a", "([A-Za-z]+), [A-Za-z]+, > (\\d+)", "$1_$2")){code} > A side effect of regexp_replace is that if the regex fails to match the > entire string is returned. > > |a|extracted_a| > |Hello, World, 1234|Hello_1234| > |Good, bye, friend|Good, bye, friend| > 2. So, to achieve the desired result, a check has to be done to prune the > rows that did not match with the pattern : > {code:java} > df = df.withColumn("extracted_a" , F.when(F.col("extracted_a")==F.col("a") , > None).otherwise(F.col("extracted_a"))){code} > > to get the following intermediate dataframe, > |a|extracted_a| > |Hello, World, 1234|Hello_1234| > |Good, bye, friend|null| > > 3. Before finally splitting the column 'extracted_a' based on underscores > {code:java} > df = df.withColumn("extracted_a" , F.split("extracted_a" , "[_]")){code} > which results in the desired result: > > > |a|extracted_a > | > |Hello, World, 1234|[Hello, 1234]| > |Good, bye, friend|null| > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org