[ 
https://issues.apache.org/jira/browse/SPARK-30645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-30645:
------------------------------------

    Assignee: Maciej Szymkiewicz

> collect() support Unicode charactes tests fails on Windows
> ----------------------------------------------------------
>
>                 Key: SPARK-30645
>                 URL: https://issues.apache.org/jira/browse/SPARK-30645
>             Project: Spark
>          Issue Type: Bug
>          Components: SparkR, Tests
>    Affects Versions: 3.0.0
>            Reporter: Maciej Szymkiewicz
>            Assignee: Maciej Szymkiewicz
>            Priority: Major
>
> As-is [test_that("collect() support Unicode 
> characters"|https://github.com/apache/spark/blob/d5b92b24c41b047c64a4d89cc4061ebf534f0995/R/pkg/tests/fulltests/test_sparkSQL.R#L850-L869]
>  case seems to be system dependent, and doesn't work properly on Windows with 
> CP1252 English locale:
>  
> {code:r}
> library(SparkR)
> SparkR::sparkR.session()
> Sys.info()
> #           sysname           release           version 
> #         "Windows"      "Server x64"     "build 17763" 
> #          nodename           machine             login 
> # "WIN-5BLT6Q610KH"          "x86-64"   "Administrator" 
> #              user    effective_user 
> #   "Administrator"   "Administrator" 
> Sys.getlocale()
> # [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United 
> States.1252;LC_MONETARY=English_United 
> States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
> lines <- c("{\"name\":\"안녕하세요\"}",
>            "{\"name\":\"您好\", \"age\":30}",
>            "{\"name\":\"こんにちは\", \"age\":19}",
>            "{\"name\":\"Xin chào\"}")
> system(paste0("cat ", jsonPath))
> # {"name":"<U+C548><U+B155><U+D558><U+C138><U+C694>"}
> # {"name":"<U+60A8><U+597D>", "age":30}
> # {"name":"<U+3053><U+3093><U+306B><U+3061><U+306F>", "age":19}
> # {"name":"Xin chào"}
> # [1] 0
> jsonPath <- tempfile(pattern = "sparkr-test", fileext = ".tmp")
> writeLines(lines, jsonPath)
> df <- read.df(jsonPath, "json")
> printSchema(df)
> # root
> #  |-- _corrupt_record: string (nullable = true)
> #  |-- age: long (nullable = true)
> #  |-- name: string (nullable = true)
> head(df)
> #              _corrupt_record age                                     name
> # 1                       <NA>  NA <U+C548><U+B155><U+D558><U+C138><U+C694>
> # 2                       <NA>  30                         <U+60A8><U+597D>
> # 3                       <NA>  19 <U+3053><U+3093><U+306B><U+3061><U+306F>
> # 4 {"name":"Xin ch<U+FFFD>o"}  NA                                     <NA>
> {code}
> Problem becomes visible on AppVoyer when testthat is updated to 2.x, but 
> somehow silenced when testthat 1.x is used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to