Github user CHOIJAEHONG1 commented on a diff in the pull request:
https://github.com/apache/spark/pull/7494#discussion_r35721159
--- Diff: R/pkg/R/deserialize.R ---
@@ -56,8 +56,10 @@ readTypedObject <- function(con, type) {
readString <- function(con) {
stringLen <- readInt(con)
- string <- readBin(con, raw(), stringLen, endian = "big")
- rawToChar(string)
+ raw <- readBin(con, raw(), stringLen, endian = "big")
+ string <- rawToChar(raw)
+ Encoding(string) <- "UTF-8"
+ enc2native(string)
--- End diff --
Yes, Perserving UTF-8 encodings sounds much better.
Do you mean that enc2native should be removed like the below?
```
readString <- function(con) {
stringLen <- readInt(con)
raw <- readBin(con, raw(), stringLen, endian = "big")
string <- rawToChar(raw)
Encoding(string) <- "UTF-8"
string
}
```
But, this makes an error in calling `createDataFrame()`, which converts a
local R dataframe to spark RDD Dataframe.
I tried to find out the reason and I guess MSB in a string is set when I
use `Encoding(string)<-"UTF-8"`, which is not otherwise.
I printed out `serializedSlices` in `paralleize()`, line 132. The result is
like the below.
context.R
```
102 parallelize <- function(sc, coll, numSlices = 1) {
103 # TODO: bound/safeguard numSlices
104 # TODO: unit tests for if the split works for all primitives
105 # TODO: support matrix, data frame, etc
106 if ((!is.list(coll) && !is.vector(coll)) || is.data.frame(coll)) {
107 if (is.data.frame(coll)) {
108 message(paste("context.R: A data frame is parallelized by
columns."))
109 } else {
110 if (is.matrix(coll)) {
111 message(paste("context.R: A matrix is parallelized by
elements."))
112 } else {
113 message(paste("context.R: parallelize() currently only supports
lists and vectors.",
114 "Calling as.list() to coerce coll into a list."))
115 }
116 }
117 coll <- as.list(coll)
118 }
119
120 print(coll)
121
122 if (numSlices > length(coll))
123 numSlices <- length(coll)
124
125 sliceLen <- ceiling(length(coll) / numSlices)
126 slices <- split(coll, rep(1:(numSlices + 1), each =
sliceLen)[1:length(coll)])
127
128 # Serialize each slice: obtain a list of raws, or a list of lists
(slices) of
129 # 2-tuples of raws
130 serializedSlices <- lapply(slices, serialize, connection = NULL)
131
132 print(serializedSlices)
133 jrdd <- callJStatic("org.apache.spark.api.r.RRDD",
134 "createRDDFromArray", sc, serializedSlices)
135
136 RDD(jrdd, "byte")
137 }
```
Case 1. with Encoding(string) <- "UTF-8"
[1] 58 0a 00 00 00 02 00 03 02 00 00 02 03 00 00 00 00 13 00 00 00 04 00
00 00
[26] 13 00 00 00 02 00 00 00 0e 00 00 00 01 7f f0 00 00 00 00 07 a2 00 00
00 10
[51] 00 00 00 01 00 00 80 09 00 00 00 0f ec 95 88 eb 85 95 ed 95 98 ec 84
b8 ec
[76] 9a 94 00 00 00 13 00 00 00 02 00 00 00 0e 00 00 00 01 40 3e 00 00 00
00 00
[101] 00 00 00 00 10 00 00 00 01 00 00 80 09 00 00 00 06 e6 82 a8 e5 a5 bd
00 00
[126] 00 13 00 00 00 02 00 00 00 0e 00 00 00 01 40 33 00 00 00 00 00 00 00
00 00
[151] 10 00 00 00 01 00 00 80 09 00 00 00 0f e3 81 93 e3 82 93 e3 81 ab e3
81 a1
[176] e3 81 af 00 00 00 13 00 00 00 02 00 00 00 0e 00 00 00 01 7f f0 00 00
00 00
[201] 07 a2 00 00 00 10 00 00 00 01 00 00 80 09 00 00 00 09 58 69 6e 20 63
68 c3
[226] a0 6f
Case 2. without Encoding(string) <- "UTF-8"
[1] 58 0a 00 00 00 02 00 03 02 00 00 02 03 00 00 00 00 13 00 00 00 04 00
00 00
[26] 13 00 00 00 02 00 00 00 0e 00 00 00 01 7f f0 00 00 00 00 07 a2 00 00
00 10
[51] 00 00 00 01 00 00 00 09 00 00 00 0f ec 95 88 eb 85 95 ed 95 98 ec 84
b8 ec
[76] 9a 94 00 00 00 13 00 00 00 02 00 00 00 0e 00 00 00 01 40 3e 00 00 00
00 00
[101] 00 00 00 00 10 00 00 00 01 00 00 00 09 00 00 00 06 e6 82 a8 e5 a5 bd
00 00
[126] 00 13 00 00 00 02 00 00 00 0e 00 00 00 01 40 33 00 00 00 00 00 00 00
00 00
[151] 10 00 00 00 01 00 00 00 09 00 00 00 0f e3 81 93 e3 82 93 e3 81 ab e3
81 a1
[176] e3 81 af 00 00 00 13 00 00 00 02 00 00 00 0e 00 00 00 01 7f f0 00 00
00 00
[201] 07 a2 00 00 00 10 00 00 00 01 00 00 00 09 00 00 00 09 58 69 6e 20 63
68 c3
[226] a0 6f
You can see [51], [101], [151], [201] are different. There is a leading 80
before 09 with `Encoding()` which, I guess, makes an error.
I think this is the the encoding indication bit in R according to the link
you gave.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]