[
https://issues.apache.org/jira/browse/ARROW-7018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17138787#comment-17138787
]
Neal Richardson commented on ARROW-7018:
----------------------------------------
[~apitrou] yes, we're talking about windows.
I tagged you less for your opinion about what R users should do and more for
your knowledge of the Arrow format and what assumptions other parts of the
project make. Specifically, how we handle non-UTF string data, and whether it
is forbidden to put non-UTF-8 data in a string Array, or whether it is perhaps
not explicitly illegal, maybe just discouraged.
I can't think of how exactly I would indicate that this StringArray has non-UTF
encoding. As part of a RecordBatch or Table, I could attach some custom
metadata, but nothing in Arrow would know what to do with that, for better or
worse.
I'm ok if the policy is that Arrow strings are always UTF-8, I can work with
that, but it's not clear that that is enforced, or actually true, or if all
Arrow libraries share that policy. It seems from the C++ pretty print methods
at least that there's some reliance on system locale, but maybe that's
something to be addressed.
> [R] Special characters as question mark in parquet files
> --------------------------------------------------------
>
> Key: ARROW-7018
> URL: https://issues.apache.org/jira/browse/ARROW-7018
> Project: Apache Arrow
> Issue Type: Bug
> Components: R
> Affects Versions: 0.15.0
> Environment: I'm running R on Windows 10
> Reporter: Vidar Ingason
> Assignee: Romain Francois
> Priority: Critical
> Fix For: 1.0.0
>
>
> Hello.
> I'm new to the arrow package in R and I'm having a trouble regarding special
> characters (Icelandic). I have a large data set and everything is fine until
> I write the file to disk and read it in again (i.e. I use write_parquet() and
> then read_parquet()). When I read the data back in to R special characters
> turn into question mark. I.e. Veitingastaðir becomes Veitingasta�ir.
> This does not happen when I use .csv.
> Is there anything I can do when I write the .parquet file to disk or when I
> read it in to prevent this?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)