[GitHub] [arrow] thisisnic opened a new issue, #34640: [C++] Can't read in partitioning column in CSV datasets when both partition and schema supplied

via GitHub Mon, 20 Mar 2023 03:15:56 -0700


thisisnic opened a new issue, #34640:
URL: https://github.com/apache/arrow/issues/34640


   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   Surfaced via R:
   
   ``` r
   library(dplyr)
   library(arrow)
   
   # set up temporary directory
   tf <- tempfile()
   dir.create(tf)
   
   # set up dummy dataset
   df <- tibble::tibble(group = rep(1:2, each = 5), value = 1:10)
   df
   #> # A tibble: 10 × 2
   #>    group value
   #>    <int> <int>
   #>  1     1     1
   #>  2     1     2
   #>  3     1     3
   #>  4     1     4
   #>  5     1     5
   #>  6     2     6
   #>  7     2     7
   #>  8     2     8
   #>  9     2     9
   #> 10     2    10
   
   # write dataset
   write_dataset(df, tf, format = "csv", partitioning = "group", hive_style = 
FALSE)
   
   list.files(tf, recursive = TRUE)
   #> [1] "1/part-0.csv" "2/part-0.csv"
   
   # with just partitioning, successfully can read back in partitioning variable
   open_dataset(
     tf,
     format = "csv",
     partitioning = "group"
   ) %>% collect()
   #> # A tibble: 10 × 2
   #>    value group
   #>    <int> <int>
   #>  1     1     1
   #>  2     2     1
   #>  3     3     1
   #>  4     4     1
   #>  5     5     1
   #>  6     6     2
   #>  7     7     2
   #>  8     8     2
   #>  9     9     2
   #> 10    10     2
   
   # with partitioning and schema supplied, "group" variable is not included
   open_dataset(
     tf,
     format = "csv",
     schema = schema(value = int32()),
     skip = 1,
     partitioning = schema(group = int32())
   ) %>% collect()
   #> # A tibble: 10 × 1
   #>    value
   #>    <int>
   #>  1     1
   #>  2     2
   #>  3     3
   #>  4     4
   #>  5     5
   #>  6     6
   #>  7     7
   #>  8     8
   #>  9     9
   #> 10    10
   
   # we can't add the partitioning variable to the schema or we get an error
   open_dataset(
     tf,
     format = "csv",
     schema = schema(value = int32(), group = int32()),
     skip = 1,
     partitioning = schema(group = int32())
   ) %>% collect()
   #> Error in `compute.Dataset()` at r/R/dplyr-collect.R:33:2:
   #> ! Invalid: Could not open CSV input source 
'/tmp/RtmpUTGEHf/file492ad7322363a/1/part-0.csv': Invalid: CSV parse error: Row 
#2: Expected 2 columns, got 1: 1
   #> /home/nic2/arrow/cpp/src/arrow/csv/parser.cc:477  
(ParseLine<SpecializedOptions, false>(values_writer, parsed_writer, data, 
data_end, is_final, &line_end, bulk_filter))
   #> /home/nic2/arrow/cpp/src/arrow/csv/parser.cc:566  
ParseChunk<SpecializedOptions>( &values_writer, &parsed_writer, data, data_end, 
is_final, rows_in_chunk, &data, &finished_parsing, bulk_filter)
   #> /home/nic2/arrow/cpp/src/arrow/csv/reader.cc:426  
parser->ParseFinal(views, &parsed_size)
   
   #> Backtrace:
   #>      ▆
   #>   1. ├─... %>% collect()
   #>   2. ├─dplyr::collect(.)
   #>   3. └─arrow:::collect.Dataset(.)
   #>   4.   ├─arrow:::collect.ArrowTabular(compute.Dataset(x), as_data_frame) 
at r/R/dplyr-collect.R:33:2
   #>   5.   │ └─base::as.data.frame(x, ...) at r/R/dplyr-collect.R:27:4
   #>   6.   └─arrow:::compute.Dataset(x) at r/R/dplyr-collect.R:33:2
   #>   7.     └─base::tryCatch(...) at r/R/dplyr-collect.R:40:2
   #>   8.       └─base (local) tryCatchList(expr, classes, parentenv, handlers)
   #>   9.         └─base (local) tryCatchOne(expr, names, parentenv, 
handlers[[1L]])
   #>  10.           └─value[[3L]](cond)
   #>  11.             └─arrow:::augment_io_error_msg(e, call, schema = 
schema()) at r/R/dplyr-collect.R:49:6
   #>  12.               └─rlang::abort(msg, call = call) at r/R/util.R:251:2
   ```
   
   This was discussed on #32938 but the solution mentioned there works for 
Parquet files but not CSV.
   
   ### Component(s)
   
   C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] thisisnic opened a new issue, #34640: [C++] Can't read in partitioning column in CSV datasets when both partition and schema supplied

Reply via email to