[GitHub] spark pull request: [SPARK-8951][SparkR] support Unicode character...

CHOIJAEHONG1 Tue, 28 Jul 2015 18:37:40 -0700

Github user CHOIJAEHONG1 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/7494#discussion_r35721159
  
    --- Diff: R/pkg/R/deserialize.R ---
    @@ -56,8 +56,10 @@ readTypedObject <- function(con, type) {
     
     readString <- function(con) {
       stringLen <- readInt(con)
    -  string <- readBin(con, raw(), stringLen, endian = "big")
    -  rawToChar(string)
    +  raw <- readBin(con, raw(), stringLen, endian = "big")
    +  string <- rawToChar(raw)
    +  Encoding(string) <- "UTF-8"
    +  enc2native(string)
    --- End diff --
    
    Yes, Perserving UTF-8 encodings sounds much better.
    Do you mean that enc2native should be removed like the below?
    ```
    readString <- function(con) {
      stringLen <- readInt(con)
      raw <- readBin(con, raw(), stringLen, endian = "big")
      string <- rawToChar(raw)
      Encoding(string) <- "UTF-8"
      string
    }
    ```
    But, this makes an error in calling `createDataFrame()`, which converts a 
local R dataframe to spark RDD Dataframe.
    I tried to find out the reason and I guess MSB in a string is set when I 
use `Encoding(string)<-"UTF-8"`, which is not otherwise.
    
    I printed out `serializedSlices` in `paralleize()`, line 132. The result is 
like the below.
    context.R
    ```
    102 parallelize <- function(sc, coll, numSlices = 1) {
    103   # TODO: bound/safeguard numSlices
    104   # TODO: unit tests for if the split works for all primitives
    105   # TODO: support matrix, data frame, etc
    106   if ((!is.list(coll) && !is.vector(coll)) || is.data.frame(coll)) {
    107     if (is.data.frame(coll)) {
    108       message(paste("context.R: A data frame is parallelized by 
columns."))
    109     } else {
    110       if (is.matrix(coll)) {
    111         message(paste("context.R: A matrix is parallelized by 
elements."))
    112       } else {
    113         message(paste("context.R: parallelize() currently only supports 
lists and vectors.",
    114                       "Calling as.list() to coerce coll into a list."))
    115       }
    116     }
    117     coll <- as.list(coll)
    118   }
    119
    120   print(coll)
    121
    122   if (numSlices > length(coll))
    123     numSlices <- length(coll)
    124
    125   sliceLen <- ceiling(length(coll) / numSlices)
    126   slices <- split(coll, rep(1:(numSlices + 1), each = 
sliceLen)[1:length(coll)])
    127
    128   # Serialize each slice: obtain a list of raws, or a list of lists 
(slices) of
    129   # 2-tuples of raws
    130   serializedSlices <- lapply(slices, serialize, connection = NULL)
    131
    132   print(serializedSlices)
    133   jrdd <- callJStatic("org.apache.spark.api.r.RRDD",
    134                       "createRDDFromArray", sc, serializedSlices)
    135
    136   RDD(jrdd, "byte")
    137 }
    ```
    Case 1. with Encoding(string) <- "UTF-8"
      [1] 58 0a 00 00 00 02 00 03 02 00 00 02 03 00 00 00 00 13 00 00 00 04 00 
00 00
     [26] 13 00 00 00 02 00 00 00 0e 00 00 00 01 7f f0 00 00 00 00 07 a2 00 00 
00 10
     [51] 00 00 00 01 00 00 80 09 00 00 00 0f ec 95 88 eb 85 95 ed 95 98 ec 84 
b8 ec
     [76] 9a 94 00 00 00 13 00 00 00 02 00 00 00 0e 00 00 00 01 40 3e 00 00 00 
00 00
    [101] 00 00 00 00 10 00 00 00 01 00 00 80 09 00 00 00 06 e6 82 a8 e5 a5 bd 
00 00
    [126] 00 13 00 00 00 02 00 00 00 0e 00 00 00 01 40 33 00 00 00 00 00 00 00 
00 00
    [151] 10 00 00 00 01 00 00 80 09 00 00 00 0f e3 81 93 e3 82 93 e3 81 ab e3 
81 a1
    [176] e3 81 af 00 00 00 13 00 00 00 02 00 00 00 0e 00 00 00 01 7f f0 00 00 
00 00
    [201] 07 a2 00 00 00 10 00 00 00 01 00 00 80 09 00 00 00 09 58 69 6e 20 63 
68 c3
    [226] a0 6f
    
    Case 2. without Encoding(string) <- "UTF-8"
     [1] 58 0a 00 00 00 02 00 03 02 00 00 02 03 00 00 00 00 13 00 00 00 04 00 
00 00
     [26] 13 00 00 00 02 00 00 00 0e 00 00 00 01 7f f0 00 00 00 00 07 a2 00 00 
00 10
     [51] 00 00 00 01 00 00 00 09 00 00 00 0f ec 95 88 eb 85 95 ed 95 98 ec 84 
b8 ec
     [76] 9a 94 00 00 00 13 00 00 00 02 00 00 00 0e 00 00 00 01 40 3e 00 00 00 
00 00
    [101] 00 00 00 00 10 00 00 00 01 00 00 00 09 00 00 00 06 e6 82 a8 e5 a5 bd 
00 00
    [126] 00 13 00 00 00 02 00 00 00 0e 00 00 00 01 40 33 00 00 00 00 00 00 00 
00 00
    [151] 10 00 00 00 01 00 00 00 09 00 00 00 0f e3 81 93 e3 82 93 e3 81 ab e3 
81 a1
    [176] e3 81 af 00 00 00 13 00 00 00 02 00 00 00 0e 00 00 00 01 7f f0 00 00 
00 00
    [201] 07 a2 00 00 00 10 00 00 00 01 00 00 00 09 00 00 00 09 58 69 6e 20 63 
68 c3
    [226] a0 6f
    
    You can see [51], [101], [151], [201] are different. There is a leading 80 
before 09 with `Encoding()` which, I guess, makes an error.
    I think this is the the encoding indication bit in R according to the link 
you gave.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-8951][SparkR] support Unicode character...

Reply via email to