[
https://issues.apache.org/jira/browse/ARROW-7018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16962823#comment-16962823
]
Vidar Ingason edited comment on ARROW-7018 at 10/30/19 9:05 AM:
----------------------------------------------------------------
Hi Neal
Here is a small code that will reproduces this issue.
{code:java}
library(tidyverse)
library(arrow)
df <- tibble(a = c("Veitingastaðir"),
b = 10)
write_parquet(df, "test.parquet")
df_read <- read_parquet("test.parquet")
\\{code}
Sessioninfo:
R version 3.6.1 (2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 17134)
Matrix products: default
locale:
[1] LC_COLLATE=Icelandic_Iceland.1252 LC_CTYPE=Icelandic_Iceland.1252
LC_MONETARY=Icelandic_Iceland.1252
[4] LC_NUMERIC=C LC_TIME=Icelandic_Iceland.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] arrow_0.15.0 forcats_0.4.0 stringr_1.4.0 dplyr_0.8.3 purrr_0.3.3
readr_1.3.1 tidyr_1.0.0 tibble_2.1.3
[9] ggplot2_3.2.1 tidyverse_1.2.1
loaded via a namespace (and not attached):
[1] Rcpp_1.0.2 cellranger_1.1.0 pillar_1.4.2 compiler_3.6.1 tools_3.6.1
bit_1.1-14 zeallot_0.1.0
[8] jsonlite_1.6 lubridate_1.7.4 lifecycle_0.1.0 nlme_3.1-140 gtable_0.3.0
lattice_0.20-38 pkgconfig_2.0.3
[15] rlang_0.4.1 cli_1.1.0 rstudioapi_0.10 haven_2.1.1 withr_2.1.2 xml2_1.2.2
httr_1.4.1
[22] generics_0.0.2 vctrs_0.2.0 hms_0.5.1 bit64_0.9-7 grid_3.6.1
tidyselect_0.2.5 glue_1.3.1
[29] R6_2.4.0 fansi_0.4.0 readxl_1.3.1 modelr_0.1.5 magrittr_1.5
backports_1.1.5 scales_1.0.0
[36] rvest_0.3.4 assertthat_0.2.1 colorspace_1.4-1 utf8_1.1.4 stringi_1.4.3
lazyeval_0.2.2 munsell_0.5.0
[43] broom_0.5.2 crayon_1.3.4
was (Author: vidaringa):
Hi Neal
Here is a small code that will reproduces this issue.
{code:java}
library(tidyverse)
library(arrow)
df <- tibble(a = c("Veitingastaðir"),
b = 10)
write_parquet(df, "test.parquet")
df_read <- read_parquet("test.parquet")
{code}
> Special characters as question mark in parquet files in R
> ---------------------------------------------------------
>
> Key: ARROW-7018
> URL: https://issues.apache.org/jira/browse/ARROW-7018
> Project: Apache Arrow
> Issue Type: Bug
> Components: R
> Affects Versions: 0.15.0
> Environment: I'm running R on Windows 10
> Reporter: Vidar Ingason
> Priority: Major
>
> Hello.
> I'm new to the arrow package in R and I'm having a trouble regarding special
> characters (Icelandic). I have a large data set and everything is fine until
> I write the file to disk and read it in again (i.e. I use write_parquet() and
> then read_parquet()). When I read the data back in to R special characters
> turn into question mark. I.e. Veitingastaðir becomes Veitingasta�ir.
> This does not happen when I use .csv.
> Is there anything I can do when I write the .parquet file to disk or when I
> read it in to prevent this?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)