[ https://issues.apache.org/jira/browse/SPARK-16203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16551885#comment-16551885 ]
Nick Nicolini edited comment on SPARK-16203 at 7/22/18 3:21 AM: ---------------------------------------------------------------- [~srowen] [~hvanhovell] I want to re-open this discussion. I've recently hit many cases of regexp parsing where we need to match on something that is always arbitrary in length; for example, a text block that looks something like: {code:java} AAA:WORDS| BBB:TEXT| MSG:ASDF| MSG:QWER| ... MSG:ZXCV|{code} Where I need to pull out all values between "MSG:" and "|", which can occur in each instance between 1 and n times. I cannot reliably use the method shown above, and while I can write a UDF to handle this it'd be great if this was supported natively in Spark. Perhaps we can implement something like "regexp_extract_all" as [Presto|https://prestodb.io/docs/current/functions/regexp.html] and [Pig|https://pig.apache.org/docs/latest/api/org/apache/pig/builtin/REGEX_EXTRACT_ALL.html] have? was (Author: nnicolini): [~srowen] [~hvanhovell] I want to re-open this discussion. I've recently hit many cases of regexp parsing where we need to match on something that is always arbitrary in length; for example, a text block that looks something like: {code:java} AAA:WORDS| BBB:TEXT| MSG:ASDF| MSG:QWER| ... MSG:ZXCV|{code} Where I need to pull out all values between "MSG:" and "|", which can occur in each instance between 1 and n times. I cannot reliably use the method shown above, and while I can write a UDF to handle this it'd be great if this was supported natively in Spark. Perhaps we can implement something like "regexp_extract_all" as [presto|https://prestodb.io/docs/current/functions/regexp.html] and [pig|https://pig.apache.org/docs/latest/api/org/apache/pig/builtin/REGEX_EXTRACT_ALL.html] have? > regexp_extract to return an ArrayType(StringType()) > --------------------------------------------------- > > Key: SPARK-16203 > URL: https://issues.apache.org/jira/browse/SPARK-16203 > Project: Spark > Issue Type: Improvement > Affects Versions: 2.0.0 > Reporter: Max Moroz > Priority: Minor > > regexp_extract only returns a single matched group. If (as if often the case > - e.g., web log parsing) we need to parse the entire line and get all the > groups, we'll need to call it as many times as there are groups. > It's only a minor annoyance syntactically. > But unless I misunderstand something, it would be very inefficient. (How > would Spark know not to do multiple pattern matching operations, when only > one is needed? Or does the optimizer actually check whether the patterns are > identical, and if they are, avoid the repeated regex matching operations??) > Would it be possible to have it return an array when the index is not > specified (defaulting to None)? -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org