Github user xuejianbest commented on a diff in the pull request:
https://github.com/apache/spark/pull/22048#discussion_r214778257
--- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala ---
@@ -2794,6 +2794,30 @@ private[spark] object Utils extends Logging {
}
}
}
+
+ /**
+ * Regular expression matching full width characters
+ */
+ private val fullWidthRegex = ("""[""" +
+ // scalastyle:off nonascii
+ """\u1100-\u115F""" +
+ """\u2E80-\uA4CF""" +
+ """\uAC00-\uD7A3""" +
+ """\uF900-\uFAFF""" +
+ """\uFE10-\uFE19""" +
+ """\uFE30-\uFE6F""" +
+ """\uFF00-\uFF60""" +
+ """\uFFE0-\uFFE6""" +
--- End diff --
> Can you describe them there and put a references to a public unicode
document?
This is a regular expression match using unicode, regardless of the
specific encoding.
For example, the following string is encoded using gbk instead of utf8, and
the match still worksï¼
`
val bytes = Array[Byte](0xd6.toByte, 0xd0.toByte, 0xB9.toByte,
0xFA.toByte)
val s1 = new String(bytes, "gbk")
println(s1) //ä¸å½
val fullWidthRegex = ("""[""" +
// scalastyle:off nonascii
"""\u1100-\u115F""" +
"""\u2E80-\uA4CF""" +
"""\uAC00-\uD7A3""" +
"""\uF900-\uFAFF""" +
"""\uFE10-\uFE19""" +
"""\uFE30-\uFE6F""" +
"""\uFF00-\uFF60""" +
"""\uFFE0-\uFFE6""" +
// scalastyle:on nonascii
"""]""").r
println(fullWidthRegex.findAllIn(s1).size) //2
`
This regular expression is obtained experimentally under a specific font.
I don't understand what you are going to do.
> How about some additional overheads when calling showString as compared
to showString w/o this patch?
I tested a Dataset consisting of 100 rows, each row has two columns, one
column is the index (0-99), and the other column is a random string of length
100 characters, and then the showString display is called separately.
The original showString method (w/o this patch) took about 42ms, and the
improved time took about 46ms, and the performance was about 10% worse.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]