Github user xuejianbest commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22048#discussion_r214614936
  
    --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala ---
    @@ -2794,6 +2794,30 @@ private[spark] object Utils extends Logging {
           }
         }
       }
    +
    +  /**
    +   * Regular expression matching full width characters
    +   */
    +  private val fullWidthRegex = ("""[""" +
    +    // scalastyle:off nonascii
    +    """\u1100-\u115F""" +
    +    """\u2E80-\uA4CF""" +
    +    """\uAC00-\uD7A3""" +
    +    """\uF900-\uFAFF""" +
    +    """\uFE10-\uFE19""" +
    +    """\uFE30-\uFE6F""" +
    +    """\uFF00-\uFF60""" +
    +    """\uFFE0-\uFFE6""" +
    --- End diff --
    
    - How to get this Regex list? Any reference? It sounds like this should be 
a general problem
    
    I looked at all the 0x0000-0xFFFF characters (unicode) and showed them 
under Xshell, then found all the full width characters. Get the regular 
expression.
    
    
    - What is the performance impact?
    
    I generated 1000 strings, each consisting of 1000 characters with a random 
unicode of 0x0000-0xFFFF. (a total of 1 million characters.)
    Then use this regular expression to find the full width character of these 
strings.
    I tested 100 rounds and then averaged.
    It takes 49 milliseconds to complete matching all 1000 strings.
    
    @gatorsmile 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to