Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18477#discussion_r125298719
  
    --- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala
 ---
    @@ -268,7 +268,7 @@ case class StringSplit(str: Expression, pattern: 
Expression)
       usage = "_FUNC_(str, regexp, rep) - Replaces all substrings of `str` 
that match `regexp` with `rep`.",
       extended = """
         Examples:
    -      > SELECT _FUNC_('100-200', '(\d+)', 'num');
    +      > SELECT _FUNC_('100-200', '(\\d+)', 'num');
    --- End diff --
    
    Hmm, when I wrote the docs on line 160, I was suggested to use unescaped 
characters.
    
    > Since Spark 2.0, string literals (including regex patterns) are unescaped 
in our SQL parser. For example, to match "\abc", a regular expression for 
`regexp` can be "^\\abc$".
    
    Actually, you need to write like this in spark-shell:
    
        scala> sql("SELECT like('\\\\abc', '\\\\\\\\abc')").show
        +---------------+
        |\abc LIKE \\abc|
        +---------------+
        |           true|
        +---------------+
    
        scala> sql("SELECT regexp_replace('100-200', '(\\\\d+)', 'num')").show
        +-----------------------------------+
        |regexp_replace(100-200, (\d+), num)|
        +-----------------------------------+
        |                            num-num|
        +-----------------------------------+
    
    
    The behavior of Spark 2 when parsing SQL string literal reads `\\\\abc`  as 
`\abc` and `(\\\\d+)` as `(\d+)` in spark-shell.
    
    But in spark-sql, you write the queries like this:
    
        spark-sql> SELECT like('\\abc', '\\\\abc');
        true
        Time taken: 0.061 seconds, Fetched 1 row(s)
    
        spark-sql> SELECT regexp_replace('100-200', '(\\d+)', 'num');
        num-num
        Time taken: 0.117 seconds, Fetched 1 row(s)
    
    So depending how the shell environment processes string escaping, the query 
looks different. In the docs, it seems to me that writing in unescaped style 
can avoid this confusion?
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to