thisisnic opened a new issue, #34640:
URL: https://github.com/apache/arrow/issues/34640
### Describe the bug, including details regarding any error messages,
version, and platform.
Surfaced via R:
``` r
library(dplyr)
library(arrow)
# set up temporary directory
tf <- tempfile()
dir.create(tf)
# set up dummy dataset
df <- tibble::tibble(group = rep(1:2, each = 5), value = 1:10)
df
#> # A tibble: 10 × 2
#> group value
#> <int> <int>
#> 1 1 1
#> 2 1 2
#> 3 1 3
#> 4 1 4
#> 5 1 5
#> 6 2 6
#> 7 2 7
#> 8 2 8
#> 9 2 9
#> 10 2 10
# write dataset
write_dataset(df, tf, format = "csv", partitioning = "group", hive_style =
FALSE)
list.files(tf, recursive = TRUE)
#> [1] "1/part-0.csv" "2/part-0.csv"
# with just partitioning, successfully can read back in partitioning variable
open_dataset(
tf,
format = "csv",
partitioning = "group"
) %>% collect()
#> # A tibble: 10 × 2
#> value group
#> <int> <int>
#> 1 1 1
#> 2 2 1
#> 3 3 1
#> 4 4 1
#> 5 5 1
#> 6 6 2
#> 7 7 2
#> 8 8 2
#> 9 9 2
#> 10 10 2
# with partitioning and schema supplied, "group" variable is not included
open_dataset(
tf,
format = "csv",
schema = schema(value = int32()),
skip = 1,
partitioning = schema(group = int32())
) %>% collect()
#> # A tibble: 10 × 1
#> value
#> <int>
#> 1 1
#> 2 2
#> 3 3
#> 4 4
#> 5 5
#> 6 6
#> 7 7
#> 8 8
#> 9 9
#> 10 10
# we can't add the partitioning variable to the schema or we get an error
open_dataset(
tf,
format = "csv",
schema = schema(value = int32(), group = int32()),
skip = 1,
partitioning = schema(group = int32())
) %>% collect()
#> Error in `compute.Dataset()` at r/R/dplyr-collect.R:33:2:
#> ! Invalid: Could not open CSV input source
'/tmp/RtmpUTGEHf/file492ad7322363a/1/part-0.csv': Invalid: CSV parse error: Row
#2: Expected 2 columns, got 1: 1
#> /home/nic2/arrow/cpp/src/arrow/csv/parser.cc:477
(ParseLine<SpecializedOptions, false>(values_writer, parsed_writer, data,
data_end, is_final, &line_end, bulk_filter))
#> /home/nic2/arrow/cpp/src/arrow/csv/parser.cc:566
ParseChunk<SpecializedOptions>( &values_writer, &parsed_writer, data, data_end,
is_final, rows_in_chunk, &data, &finished_parsing, bulk_filter)
#> /home/nic2/arrow/cpp/src/arrow/csv/reader.cc:426
parser->ParseFinal(views, &parsed_size)
#> Backtrace:
#> ▆
#> 1. ├─... %>% collect()
#> 2. ├─dplyr::collect(.)
#> 3. └─arrow:::collect.Dataset(.)
#> 4. ├─arrow:::collect.ArrowTabular(compute.Dataset(x), as_data_frame)
at r/R/dplyr-collect.R:33:2
#> 5. │ └─base::as.data.frame(x, ...) at r/R/dplyr-collect.R:27:4
#> 6. └─arrow:::compute.Dataset(x) at r/R/dplyr-collect.R:33:2
#> 7. └─base::tryCatch(...) at r/R/dplyr-collect.R:40:2
#> 8. └─base (local) tryCatchList(expr, classes, parentenv, handlers)
#> 9. └─base (local) tryCatchOne(expr, names, parentenv,
handlers[[1L]])
#> 10. └─value[[3L]](cond)
#> 11. └─arrow:::augment_io_error_msg(e, call, schema =
schema()) at r/R/dplyr-collect.R:49:6
#> 12. └─rlang::abort(msg, call = call) at r/R/util.R:251:2
```
This was discussed on #32938 but the solution mentioned there works for
Parquet files but not CSV.
### Component(s)
C++
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]