[jira] [Commented] (ARROW-15252) [R] Expose skip_rows_after in CSVReadOptions

Martin du Toit (Jira) Wed, 05 Jan 2022 04:25:06 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-15252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17469265#comment-17469265
 ]


Martin du Toit commented on ARROW-15252:
----------------------------------------

I actually used reticulate to call the pyarrow functions, not sure if that 
might be the problem?
{code:r}
pa <- reticulate::import("pyarrow", convert = FALSE)
pyds <- reticulate::import("pyarrow.dataset", convert = FALSE)

skiphead <- 2 # skips header row and column names

fs_conv = pa$csv$ConvertOptions(column_types = schema_file, strings_can_be_null 
= TRUE, include_missing_columns = TRUE, include_columns = col_names)
fs_pars = pa$csv$ParseOptions(delimiter= delimiter, ignore_empty_lines = TRUE)
fs_read = pa$csv$ReadOptions(skip_rows = skiphead, column_names = col_names)

csv_format = pyds$CsvFileFormat(read_options = fs_read, parse_options = 
fs_pars, convert_options = fs_conv)

ds = pyds$dataset(source=dl_path, schema = schema_df, format=csv_format, 
filesystem=fs, partitioning=.partition_cols)

{code}

> [R] Expose skip_rows_after in CSVReadOptions 
> ---------------------------------------------
>
>                 Key: ARROW-15252
>                 URL: https://issues.apache.org/jira/browse/ARROW-15252
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: R
>            Reporter: Martin du Toit
>            Assignee: Nicola Crane
>            Priority: Major
>         Attachments: I2478172_Activity_20180830.csv
>
>
> Not sure if this is a bug, but if I open_dataset of a directory containing 
> csv files with a header and a footer, I specify the following convert options 
> to include_missing_columns. The code works fine on files with no header and 
> footer
> {code:r}
> col_names <- c("col names specified as in 2nd row of file") #ie colnames is 
> known
> skip <- 2
> file_path <- "path to directory holding various files"
> #schema_file <- created using arrow::schema
> #schema_df<- created using arrow::schema but with extra columns for the 
> .partition_cols
> conv_options <- CsvConvertOptions$create(strings_can_be_null = TRUE, 
> include_missing_columns = TRUE, include_columns = col_names) 
> read_options <- arrow:::readr_to_csv_read_options(skip, col_names)
> format <- arrow::FileFormat$create(format = "text", schema = schema_file, 
> convert_options = conv_options, read_options  = read_options)
> ds <- arrow::open_dataset(sources = file_path, schema = schema_df, 
> partitioning = .partition_cols, format = format){code}
> The dataset gets created, but any further operation on the dataset fail with
> {code:r}
> Error: Invalid: CSV parse error: Row #7: Expected 41 columns, got 3: T,7,
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-15252) [R] Expose skip_rows_after in CSVReadOptions

Reply via email to