spark git commit: [SPARK-25108][SQL] Fix the show method to display the wide character alignment problem

2018-09-06 Thread srowen
Repository: spark
Updated Branches:
  refs/heads/branch-2.4 3682d29f4 -> a7cfe5158


[SPARK-25108][SQL] Fix the show method to display the wide character alignment 
problem

This is not a perfect solution. It is designed to minimize complexity on the 
basis of solving problems.

It is effective for English, Chinese characters, Japanese, Korean and so on.

```scala
before:
+---+---+-+
|id |中国 |s2   |
+---+---+-+
|1  |ab |[a]  |
|2  |null   |[中国, abc]|
|3  |ab1|[hello world]|
|4  |か行 きゃ(kya) きゅ(kyu) きょ(kyo) |[“中国]|
|5  |中国(你好)a|[“中(国), 312] |
|6  |中国山(东)服务区  |[“中(国)]  |
|7  |中国山东服务区|[中(国)]   |
|8  |   |[中国] |
+---+---+-+

after:
+---+---++
|id |中国   |s2  |
+---+---++
|1  |ab |[a] |
|2  |null   |[中国, abc] |
|3  |ab1|[hello world]   |
|4  |か行 きゃ(kya) きゅ(kyu) きょ(kyo) |[“中国] |
|5  |中国(你好)a  |[“中(国), 312]|
|6  |中国山(东)服务区   |[“中(国)]  |
|7  |中国山东服务区 |[中(国)]|
|8  |   |[中国]  |
+---+---++
```

## What changes were proposed in this pull request?

When there are wide characters such as Chinese characters or Japanese 
characters in the data, the show method has a alignment problem.
Try to fix this problem.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, 
manual tests)

![image](https://user-images.githubusercontent.com/13044869/44250564-69f6b400-a227-11e8-88b2-6cf6960377ff.png)

Please review http://spark.apache.org/contributing.html before opening a pull 
request.

Closes #22048 from xuejianbest/master.

Authored-by: xuejianbest <384329...@qq.com>
Signed-off-by: Sean Owen 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a7cfe515
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a7cfe515
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a7cfe515

Branch: refs/heads/branch-2.4
Commit: a7cfe5158f5c25ae5f774e1fb45d63a67a4bb89c
Parents: 3682d29
Author: xuejianbest <384329...@qq.com>
Authored: Thu Sep 6 07:17:37 2018 -0700
Committer: Sean Owen 
Committed: Thu Sep 6 10:48:22 2018 -0700

--
 .../scala/org/apache/spark/util/Utils.scala | 30 
 .../org/apache/spark/util/UtilsSuite.scala  | 21 +
 .../scala/org/apache/spark/sql/Dataset.scala| 18 +++
 .../org/apache/spark/sql/DatasetSuite.scala | 49 
 4 files changed, 109 insertions(+), 9 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/a7cfe515/core/src/main/scala/org/apache/spark/util/Utils.scala
--
diff --git a/core/src/main/scala/org/apache/spark/util/Utils.scala 
b/core/src/main/scala/org/apache/spark/util/Utils.scala
index 15c958d..4593b05 100644
--- a/core/src/main/scala/org/apache/spark/util/Utils.scala
+++ b/core/src/main/scala/org/apache/spark/util/Utils.scala
@@ -2795,6 +2795,36 @@ private[spark] object Utils extends Logging {
   }
 }
   }
+
+  /**
+   * Regular expression matching full width characters.
+   *
+   * Looked at all the 0x-0x characters (unicode) and showed them 
under Xshell.
+   * Found all the full width characters, then get the regular expression.
+   */
+  private val fullWidthRegex = ("""[""" +
+// scalastyle:off nonascii
+"""\u1100-\u115F""" +
+"""\u2E80-\uA4CF""" +
+"""\uAC00-\uD7A3""" +
+"""\uF900-\uFAFF""" +
+"""\uFE10-\uFE19""" +
+"""\uFE30-\uFE6F""" +
+"""\uFF00-\uFF60""" +
+"""\uFFE0-\uFFE6""" +
+// scalastyle:on nonascii
+"""]""").r
+
+  /**
+   * Return the number of half widths in a given string. Note that a full 
width character
+   * occupies two half widths.
+   *
+   * For a string consisting of 1 million characters, the execution of this 
method requires
+   * about 50ms.
+   */
+  def stringHalfWidth(str: String): Int = {
+if (str == null) 0 else str.length + fullWidthRegex.findAllIn(str).size
+  }
 }
 
 private[util] object CallerContext extends

spark git commit: [SPARK-25108][SQL] Fix the show method to display the wide character alignment problem

2018-09-06 Thread srowen
Repository: spark
Updated Branches:
  refs/heads/master 64c314e22 -> f5817d8bb


[SPARK-25108][SQL] Fix the show method to display the wide character alignment 
problem

This is not a perfect solution. It is designed to minimize complexity on the 
basis of solving problems.

It is effective for English, Chinese characters, Japanese, Korean and so on.

```scala
before:
+---+---+-+
|id |中国 |s2   |
+---+---+-+
|1  |ab |[a]  |
|2  |null   |[中国, abc]|
|3  |ab1|[hello world]|
|4  |か行 きゃ(kya) きゅ(kyu) きょ(kyo) |[“中国]|
|5  |中国(你好)a|[“中(国), 312] |
|6  |中国山(东)服务区  |[“中(国)]  |
|7  |中国山东服务区|[中(国)]   |
|8  |   |[中国] |
+---+---+-+

after:
+---+---++
|id |中国   |s2  |
+---+---++
|1  |ab |[a] |
|2  |null   |[中国, abc] |
|3  |ab1|[hello world]   |
|4  |か行 きゃ(kya) きゅ(kyu) きょ(kyo) |[“中国] |
|5  |中国(你好)a  |[“中(国), 312]|
|6  |中国山(东)服务区   |[“中(国)]  |
|7  |中国山东服务区 |[中(国)]|
|8  |   |[中国]  |
+---+---++
```

## What changes were proposed in this pull request?

When there are wide characters such as Chinese characters or Japanese 
characters in the data, the show method has a alignment problem.
Try to fix this problem.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, 
manual tests)

![image](https://user-images.githubusercontent.com/13044869/44250564-69f6b400-a227-11e8-88b2-6cf6960377ff.png)

Please review http://spark.apache.org/contributing.html before opening a pull 
request.

Closes #22048 from xuejianbest/master.

Authored-by: xuejianbest <384329...@qq.com>
Signed-off-by: Sean Owen 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f5817d8b
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f5817d8b
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f5817d8b

Branch: refs/heads/master
Commit: f5817d8bb33b733eeca0154d1ed207c8d1e8513f
Parents: 64c314e
Author: xuejianbest <384329...@qq.com>
Authored: Thu Sep 6 07:17:37 2018 -0700
Committer: Sean Owen 
Committed: Thu Sep 6 07:17:37 2018 -0700

--
 .../scala/org/apache/spark/util/Utils.scala | 30 
 .../org/apache/spark/util/UtilsSuite.scala  | 21 +
 .../scala/org/apache/spark/sql/Dataset.scala| 18 +++
 .../org/apache/spark/sql/DatasetSuite.scala | 49 
 4 files changed, 109 insertions(+), 9 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/f5817d8b/core/src/main/scala/org/apache/spark/util/Utils.scala
--
diff --git a/core/src/main/scala/org/apache/spark/util/Utils.scala 
b/core/src/main/scala/org/apache/spark/util/Utils.scala
index 15c958d..4593b05 100644
--- a/core/src/main/scala/org/apache/spark/util/Utils.scala
+++ b/core/src/main/scala/org/apache/spark/util/Utils.scala
@@ -2795,6 +2795,36 @@ private[spark] object Utils extends Logging {
   }
 }
   }
+
+  /**
+   * Regular expression matching full width characters.
+   *
+   * Looked at all the 0x-0x characters (unicode) and showed them 
under Xshell.
+   * Found all the full width characters, then get the regular expression.
+   */
+  private val fullWidthRegex = ("""[""" +
+// scalastyle:off nonascii
+"""\u1100-\u115F""" +
+"""\u2E80-\uA4CF""" +
+"""\uAC00-\uD7A3""" +
+"""\uF900-\uFAFF""" +
+"""\uFE10-\uFE19""" +
+"""\uFE30-\uFE6F""" +
+"""\uFF00-\uFF60""" +
+"""\uFFE0-\uFFE6""" +
+// scalastyle:on nonascii
+"""]""").r
+
+  /**
+   * Return the number of half widths in a given string. Note that a full 
width character
+   * occupies two half widths.
+   *
+   * For a string consisting of 1 million characters, the execution of this 
method requires
+   * about 50ms.
+   */
+  def stringHalfWidth(str: String): Int = {
+if (str == null) 0 else str.length + fullWidthRegex.findAllIn(str).size
+  }
 }
 
 private[util] object CallerContext extends Logging