Martin du Toit created ARROW-15252:
--------------------------------------

             Summary: [R] open_dataset - csv file with header and footer
                 Key: ARROW-15252
                 URL: https://issues.apache.org/jira/browse/ARROW-15252
             Project: Apache Arrow
          Issue Type: Bug
          Components: R
            Reporter: Martin du Toit
         Attachments: I2478172_Activity_20180830.csv

Not sure if this is a bug, but if I open_dataset of a directory containing csv 
files with a header and a footer, I specify the following convert options to 
include_missing_columns. The code works fine on files with no header and footer
{code:r}
col_names <- c("col names specified as in 2nd row of file") #ie colnames is 
known
skip <- 2
file_path <- "path to directory holding various files"

#schema_file <- created using arrow::schema
#schema_df<- created using arrow::schema but with extra columns for the 
.partition_cols

conv_options <- CsvConvertOptions$create(strings_can_be_null = TRUE, 
include_missing_columns = TRUE, include_columns = col_names) 

read_options <- arrow:::readr_to_csv_read_options(skip, col_names)

format <- arrow::FileFormat$create(format = "text", schema = schema_file, 
convert_options = conv_options, read_options  = read_options)
ds <- arrow::open_dataset(sources = file_path, schema = schema_df, partitioning 
= .partition_cols, format = format){code}
The dataset gets created, but any further operation on the dataset fail with
{code:r}
Error: Invalid: CSV parse error: Row #7: Expected 41 columns, got 3: T,7,
{code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to