Github user ueshin commented on the pull request:
https://github.com/apache/spark/pull/1586#issuecomment-50637172
Hi @javadba, FYI.
I believe there are 3 types of "length" around string in Java/Scala.
1) the number of 16-bit characters in the string
To get this, use
[`String#length`](http://docs.oracle.com/javase/7/docs/api/java/lang/String.html#length())
like:
```scala
scala> "\uF93D\uF936\uF949\uF942".length // chinese characters
res0: Int = 4
scala> "\uD840\uDC0B\uD842\uDFB7".length // 2 surrogate pairs
res1: Int = 4
scala> "1234567890ABC".length
res2: Int = 13
```
2) the number of code points in the string
This will be covered by `Length`.
To get this, use
[`String#codePointCount`](http://docs.oracle.com/javase/7/docs/api/java/lang/String.html#codePointCount(int,%20int))
like:
```scala
scala> "\uF93D\uF936\uF949\uF942".codePointCount(0, 4) // chinese characters
res0: Int = 4
scala> "\uD840\uDC0B\uD842\uDFB7".codePointCount(0, 4) // 2 surrogate pairs
res1: Int = 2
scala> "1234567890ABC".codePointCount(0, 13)
res2: Int = 13
```
3) the length of byte array encoded from string in some charset
To get this, use
[`String#getBytes(charset)`](http://docs.oracle.com/javase/7/docs/api/java/lang/String.html#getBytes(java.lang.String))`.length`
like:
```scala
scala> "\uF93D\uF936\uF949\uF942".getBytes("utf8").length // chinese
characters
res0: Int = 12
scala> "\uD840\uDC0B\uD842\uDFB7".getBytes("utf8").length // 2 surrogate
pairs
res1: Int = 8
scala> "1234567890ABC".getBytes("utf8").length
res2: Int = 13
scala> "\uF93D\uF936\uF949\uF942".getBytes("utf16").length // chinese
characters
res3: Int = 10
scala> "\uD840\uDC0B\uD842\uDFB7".getBytes("utf16").length // 2 surrogate
pairs
res4: Int = 10
scala> "1234567890ABC".getBytes("utf16").length
res5: Int = 28
scala> "\uF93D\uF936\uF949\uF942".getBytes("utf32").length // chinese
characters
res6: Int = 16
scala> "\uD840\uDC0B\uD842\uDFB7".getBytes("utf32").length // 2 surrogate
pairs
res7: Int = 8
scala> "1234567890ABC".getBytes("utf32").length
res8: Int = 52
```
At first I guessed you wanted 3) for `Strlen` because charset related
length is only 3), but I watched a conversation indicating another type of
"length" and lost it halfway.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---