Re: [PR] GH-29184: [R] Read CSV with comma as decimal mark [arrow]

via GitHub Fri, 06 Oct 2023 10:41:10 -0700


paleolimbot commented on code in PR #38002:
URL: https://github.com/apache/arrow/pull/38002#discussion_r1349137559



##########
r/R/csv.R:
##########
@@ -279,6 +280,31 @@ read_csv_arrow <- function(file,
   eval.parent(mc)
 }
 
+#' @rdname read_delim_arrow
+#' @export
+read_csv2_arrow <- function(file,
+                            quote = '"',
+                            escape_double = TRUE,
+                            escape_backslash = FALSE,
+                            schema = NULL,
+                            col_names = TRUE,
+                            col_types = NULL,
+                            col_select = NULL,
+                            na = c("", "NA"),
+                            quoted_na = TRUE,
+                            skip_empty_rows = TRUE,
+                            skip = 0L,
+                            parse_options = NULL,
+                            convert_options = 
CsvConvertOptions$create(decimal_point = ","),
+                            read_options = NULL,
+                            as_data_frame = TRUE,
+                            timestamp_parsers = NULL) {

Review Comment:
   I imagine these are freshly copy/pasted, but maybe double-check that the 
defaults are correct.



##########
r/tests/testthat/test-csv.R:
##########
@@ -733,3 +733,10 @@ test_that("Can read CSV files from a URL", {
   expect_true(tibble::is_tibble(cu))
   expect_identical(dim(cu), c(100L, 13L))
 })
+
+test_that("read_csv2_arrow correctly parses comma decimals", {
+  tf <- tempfile()

Review Comment:
   ```suggestion
     tf <- tempfile()
     on.exit(unlink(tf))
   ```



##########
r/tests/testthat/test-dataset-csv.R:
##########
@@ -637,3 +637,18 @@ test_that("GH-34640 - CSV datasets are read in correctly 
when both schema and pa
       summarize(mean = mean(integer))
   )
 })
+
+test_that("open_dataset() with `decimal_point` argument", {
+  temp_dir <- make_temp_dir()

Review Comment:
   ```suggestion
     temp_dir <- make_temp_dir()
     on.exit(unlink(temp_dir, recursive = TRUE))
   ```



##########
r/R/csv.R:
##########
@@ -279,6 +280,31 @@ read_csv_arrow <- function(file,
   eval.parent(mc)
 }
 
+#' @rdname read_delim_arrow
+#' @export
+read_csv2_arrow <- function(file,
+                            quote = '"',
+                            escape_double = TRUE,
+                            escape_backslash = FALSE,
+                            schema = NULL,
+                            col_names = TRUE,
+                            col_types = NULL,
+                            col_select = NULL,
+                            na = c("", "NA"),
+                            quoted_na = TRUE,
+                            skip_empty_rows = TRUE,
+                            skip = 0L,
+                            parse_options = NULL,
+                            convert_options = 
CsvConvertOptions$create(decimal_point = ","),

Review Comment:
   This pattern seems like it might not generalize...if a user was using 
`read_csv_arrow(..., convert_options = ...)`, they can't simply change that to 
`read_csv2_arrow(..., convert_options = ...)` and get the behaviour they were 
expecting. I wonder if adding `decimal_point` as a top-level argument to 
`read_delim_arrow()` would lead to a more predictable relationship between 
`read_csv_arrow()` and `read_csv2_arrow()`?



##########
r/tests/testthat/test-dataset-csv.R:
##########
@@ -637,3 +637,18 @@ test_that("GH-34640 - CSV datasets are read in correctly 
when both schema and pa
       summarize(mean = mean(integer))
   )
 })
+
+test_that("open_dataset() with `decimal_point` argument", {
+  temp_dir <- make_temp_dir()
+  writeLines("x\ty\n1,2\tc", con = file.path(temp_dir, "file1.csv"))
+
+  expect_equal(
+    open_dataset(temp_dir, format = "tsv") %>% collect(),
+    tibble(x = "1,2", y = "c")
+  )
+
+  expect_equal(
+    open_dataset(temp_dir, format = "tsv", decimal_point = ",") %>% collect(),
+    tibble(x = 1.2, y = "c")
+  )

Review Comment:
   What happens if you try to pass a `decimal_point = "more than one 
character"` or `decimal_point = ""`? (Not sure if it needs a test, but worth 
checking that it doesn't crash)



##########
r/R/csv.R:
##########
@@ -279,6 +280,31 @@ read_csv_arrow <- function(file,
   eval.parent(mc)
 }
 
+#' @rdname read_delim_arrow
+#' @export
+read_csv2_arrow <- function(file,
+                            quote = '"',
+                            escape_double = TRUE,
+                            escape_backslash = FALSE,
+                            schema = NULL,
+                            col_names = TRUE,
+                            col_types = NULL,
+                            col_select = NULL,
+                            na = c("", "NA"),
+                            quoted_na = TRUE,
+                            skip_empty_rows = TRUE,
+                            skip = 0L,
+                            parse_options = NULL,
+                            convert_options = 
CsvConvertOptions$create(decimal_point = ","),

Review Comment:
   (Or simply documenting that you can add that to `convert_options` and leave 
the tidyverse-interface-copying for another time)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] GH-29184: [R] Read CSV with comma as decimal mark [arrow]

Reply via email to