[
https://issues.apache.org/jira/browse/ARROW-15252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17469199#comment-17469199
]
Nicola Crane commented on ARROW-15252:
--------------------------------------
Thanks for opening this issue [~martindut]. I think the problem here is that
the CSV reader isn't expecting the footer row and is just treating it as data
(and so you get that error as it's expecting as many columns as are in the
actual data). The C++ code includes the ability to skip footer rows, but this
isn't exposed at the R level (yet).
> [R] open_dataset - csv file with header and footer
> --------------------------------------------------
>
> Key: ARROW-15252
> URL: https://issues.apache.org/jira/browse/ARROW-15252
> Project: Apache Arrow
> Issue Type: Bug
> Components: R
> Reporter: Martin du Toit
> Priority: Major
> Attachments: I2478172_Activity_20180830.csv
>
>
> Not sure if this is a bug, but if I open_dataset of a directory containing
> csv files with a header and a footer, I specify the following convert options
> to include_missing_columns. The code works fine on files with no header and
> footer
> {code:r}
> col_names <- c("col names specified as in 2nd row of file") #ie colnames is
> known
> skip <- 2
> file_path <- "path to directory holding various files"
> #schema_file <- created using arrow::schema
> #schema_df<- created using arrow::schema but with extra columns for the
> .partition_cols
> conv_options <- CsvConvertOptions$create(strings_can_be_null = TRUE,
> include_missing_columns = TRUE, include_columns = col_names)
> read_options <- arrow:::readr_to_csv_read_options(skip, col_names)
> format <- arrow::FileFormat$create(format = "text", schema = schema_file,
> convert_options = conv_options, read_options = read_options)
> ds <- arrow::open_dataset(sources = file_path, schema = schema_df,
> partitioning = .partition_cols, format = format){code}
> The dataset gets created, but any further operation on the dataset fail with
> {code:r}
> Error: Invalid: CSV parse error: Row #7: Expected 41 columns, got 3: T,7,
> {code}
--
This message was sent by Atlassian Jira
(v8.20.1#820001)