[ 
https://issues.apache.org/jira/browse/ARROW-7018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17138787#comment-17138787
 ] 

Neal Richardson commented on ARROW-7018:
----------------------------------------

[~apitrou] yes, we're talking about windows.

I tagged you less for your opinion about what R users should do and more for 
your knowledge of the Arrow format and what assumptions other parts of the 
project make. Specifically, how we handle non-UTF string data, and whether it 
is forbidden to put non-UTF-8 data in a string Array, or whether it is perhaps 
not explicitly illegal, maybe just discouraged. 

I can't think of how exactly I would indicate that this StringArray has non-UTF 
encoding. As part of a RecordBatch or Table, I could attach some custom 
metadata, but nothing in Arrow would know what to do with that, for better or 
worse.

I'm ok if the policy is that Arrow strings are always UTF-8, I can work with 
that, but it's not clear that that is enforced, or actually true, or if all 
Arrow libraries share that policy. It seems from the C++ pretty print methods 
at least that there's some reliance on system locale, but maybe that's 
something to be addressed.


> [R] Special characters as question mark in parquet files
> --------------------------------------------------------
>
>                 Key: ARROW-7018
>                 URL: https://issues.apache.org/jira/browse/ARROW-7018
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: R
>    Affects Versions: 0.15.0
>         Environment: I'm running R on Windows 10
>            Reporter: Vidar Ingason
>            Assignee: Romain Francois
>            Priority: Critical
>             Fix For: 1.0.0
>
>
> Hello.
> I'm new to the arrow package in R and I'm having a trouble regarding special 
> characters (Icelandic). I have a large data set and everything is fine until 
> I write the file to disk and read it in again (i.e. I use write_parquet() and 
> then read_parquet()). When I read the data back in to R special characters 
> turn into question mark. I.e. Veitingastaðir becomes Veitingasta�ir.
> This does not happen when I use .csv.
> Is there anything I can do when I write the .parquet file to disk or when I 
> read it in to prevent this?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to