[
https://issues.apache.org/jira/browse/ARROW-13421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17384873#comment-17384873
]
Daniel Paierl edited comment on ARROW-13421 at 7/21/21, 12:54 PM:
------------------------------------------------------------------
Hi [~thisisnic], thanks for the super fast reply! Sorry I forgot the reprex,
its easy to forget how insular these "," vs. "." problems are when european
comma and thousand separators are standard here. Sadly, I cannot change the
format of the source data, even using .parquet files is a major departure from
what has been done in the past.
Without further ado:
h2. Reprex
{code:r}
set.seed(1)
tbl <- tibble::tibble(x = rnorm(5))
tbl
#> # A tibble: 5 x 1
#> x
#> <dbl>
#> 1 -0.626
#> 2 0.184
#> 3 -0.836
#> 4 1.60
#> 5 0.330
# write to file in european format (separator = ";", decimal mark = ".")
readr::write_csv2(tbl, "arrow_repex.csv")
# read in with delim set to ";"
arrow::read_delim_arrow(file = "arrow_repex.csv",
delim = ";")
#> # A tibble: 5 x 1
#> x
#> <chr>
#> 1 -0,626453810742332
#> 2 0,183643324222082
#> 3 -0,835628612410047
#> 4 1,595280802137792
#> 5 0,329507771815361
# works with data.table::fread with sep = ";" and dec =","
data.table::fread("arrow_repex.csv",
sep = ";", dec = ",")
#> x
#> 1: -0.6264538
#> 2: 0.1836433
#> 3: -0.8356286
#> 4: 1.5952808
#> 5: 0.3295078
{code}
h3. Session Info
{code:r}
R version 4.0.5 (2021-03-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server x64 (build 14393)
Matrix products: default
locale:
[1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252
[3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C
[5] LC_TIME=German_Germany.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] arrow_4.0.1 forcats_0.5.1 stringr_1.4.0 dplyr_1.0.6
[5] purrr_0.3.4 readr_1.4.0 tidyr_1.1.3 tibble_3.1.1
[9] ggplot2_3.3.3 tidyverse_1.3.1
{code}
edit: Updated repex
was (Author: ruser):
Hi [~thisisnic], thanks for the super fast reply! Sorry I forgot the repex, its
easy to forget how insular these "," vs. "." problems are when european comma
and thousand separators are standard here. Sadly, I cannot change the format of
the source data, even using .parquet files is a major departure from what has
been done in the past.
Without further ado:
h2. Repex
{code:r}
set.seed(1)
# random values
tbl <- tibble::tibble(x = rnorm(5))
tbl
## # A tibble: 5 x 1
## x
## <dbl>
## 1 -0.626
## 2 0.184
## 3 -0.836
## 4 1.60
## 5 0.330
# write to file in european format (separator = ";", decimal mark = ".")
readr::write_csv2(tbl, here::here("01_proc_data/arrow_repex.csv"))
# read in with delim set to ";"
arrow::read_delim_arrow(file = here::here("01_proc_data/arrow_repex.csv"),
delim = ";")
## # A tibble: 5 x 1
## x
## <chr>
## 1 -0,626453810742332
## 2 0,183643324222082
## 3 -0,835628612410047
## 4 1,595280802137792
## 5 0,329507771815361
{code}
h3. Session Info
{code:r}
R version 4.0.5 (2021-03-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server x64 (build 14393)
Matrix products: default
locale:
[1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252
[3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C
[5] LC_TIME=German_Germany.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] arrow_4.0.1 forcats_0.5.1 stringr_1.4.0 dplyr_1.0.6
[5] purrr_0.3.4 readr_1.4.0 tidyr_1.1.3 tibble_3.1.1
[9] ggplot2_3.3.3 tidyverse_1.3.1
{code}
> [R] Add choice for decimal marker in read_delim_arrow
> -----------------------------------------------------
>
> Key: ARROW-13421
> URL: https://issues.apache.org/jira/browse/ARROW-13421
> Project: Apache Arrow
> Issue Type: Improvement
> Components: R
> Affects Versions: 4.0.1
> Reporter: Daniel Paierl
> Priority: Minor
> Labels: R
>
> In the R arrow package read_delim_arrow lacks the option to specify the
> decimal marker (e.g. comma or point) in the parsing options.
> This is a major inconvenience for data with a _point_ as a decimal marker
> (european users) since the data is read in as astring which requires post-hoc
> conversion of the string to double.
>
> Request: Add a parsing option to set the decimal marker if that is possible.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)