[ 
https://issues.apache.org/jira/browse/ARROW-7018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17138739#comment-17138739
 ] 

Neal Richardson commented on ARROW-7018:
----------------------------------------

Here's a more minimal reproducer. The relevant mangling seems to happen on 
converting Arrow to R, without involving parquet, and I haven't set the 
Icelandic locale, just on default English (latin1):

{code}
> x <- "Veitingastaðir"
> x
[1] "Veitingastaðir"
> Encoding(x)
[1] "latin1"
> a <- Array$create(x)
> a
Array
<string>
[
  "Veitingastaðir"
]
> as.vector(a)
[1] "Veitingasta\xf0ir"
> identical(as.vector(a), x)
[1] FALSE
{code}

Read/write parquet doesn't seem to do any special mangling:

{code}
> f <- tempfile()
> write_parquet(Table$create(a=a), f)
> tab <- read_parquet(f, as_data_frame = FALSE)
> tab
Table
1 rows x 1 columns
$a <string>
> tab$a
ChunkedArray
<string>
[
  "Veitingastaðir"
]
> as.data.frame(tab)
                  a
1 Veitingasta<f0>ir
> as.data.frame(tab)$a
[1] "Veitingasta\xf0ir"
{code}

{{enc2native}} doesn't repair this but {{iconv}} does:

{code}
> enc2native(as.vector(a))
[1] "Veitingasta<f0>ir"
> iconv(as.vector(a), to="latin1")
[1] "Veitingastaðir"
{code}

Given that the Arrow string array is technically called {{utf8}}, maybe we 
should always convert to UTF-8 when sending strings to Arrow? This does work as 
expected:

{code}
> a2 <- Array$create(enc2utf8(x))
> as.vector(a2)
[1] "Veitingastaðir"
> Encoding(as.vector(a2))
[1] "UTF-8"
{code}

with the side effect that now the C++ pretty printing clashes with the latin1 
locale

{code}
> a2
Array
<string>
[
  "Veitingastaðir"
]
{code}

Thoughts [~wesm] [~apitrou]?

> [R] Special characters as question mark in parquet files
> --------------------------------------------------------
>
>                 Key: ARROW-7018
>                 URL: https://issues.apache.org/jira/browse/ARROW-7018
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: R
>    Affects Versions: 0.15.0
>         Environment: I'm running R on Windows 10
>            Reporter: Vidar Ingason
>            Assignee: Romain Francois
>            Priority: Critical
>             Fix For: 1.0.0
>
>
> Hello.
> I'm new to the arrow package in R and I'm having a trouble regarding special 
> characters (Icelandic). I have a large data set and everything is fine until 
> I write the file to disk and read it in again (i.e. I use write_parquet() and 
> then read_parquet()). When I read the data back in to R special characters 
> turn into question mark. I.e. Veitingastaðir becomes Veitingasta�ir.
> This does not happen when I use .csv.
> Is there anything I can do when I write the .parquet file to disk or when I 
> read it in to prevent this?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to