Github user sun-rui commented on the pull request:

    https://github.com/apache/spark/pull/7494#issuecomment-130662126
  
    @CHOIJAEHONG1, @shivaram:
    
    1. R worker can be in any locale, because R can recognize UTF-8 and 
preserve UTF-8 encoding when manipulating strings. The root cause of the issue 
is related to the different behavior of writeBin when writing a string in 
different locales.
       for example, x<-"\uac00",
      when in UTF-8 locale, the result of writeBin(x, con) is:
            ea b0 80 00
      while in C locale,  the result of writeBin(x, con) is:
            <U+AC00>
      I suspect this is a bug of writeBin, because it is expected that the 
internal representation of the string be written, instead of its display 
representation.
     
    We can fix the this issue be changing:
    writeString <- function(con, value) {
      utfVal <- enc2utf8(value)
      rawString <- charToRaw(vtfVal)
      writeInt(con, as.integer(length(rawString)))
      writeBin(rawString , con, endian = "big")
    }
    
    2. I prefer not to change current locale in the test case. As Windows does 
not support UTF locale.We can use unicode escapings in the unicode strings. 
Thus I guess we don't need to change locale. R supports escaping unicode 
characters as described in 
https://stat.ethz.ch/R-manual/R-devel/library/base/html/Quotes.html
    
    3. No necessary to use iconv.  Encoding(x) <- "UTF-8" in deserializer , and 
enc2utf8 in serializer are OK.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to