dhicks opened a new issue #12469: URL: https://github.com/apache/arrow/issues/12469
Background: I'm trying to use arrow to (a) assemble two large sets of CSVs into two (folders of) parquet files and then (b) combine the two into a single working dataset. The two sets of CSVs all have the same column structure: `article_id` (string), `phrase` (string), `n` (integer). Without specifying a schema, in step (a) `open_dataset` parses `n` as `int64` in the first set and as `int32` in the second set. This results in an error when trying to combine them (because, I guess, automatic casting from int32 to int64 isn't supported yet). The maximum value of `n` across the entire first set is 576, so I'm not sure why arrow thinks it needs 64 bits in the first place. Shorter background: In a collection of CSVs, `open_dataset` parses the column `n` as `int64` and I need it to be `int32`. An example CSV, the first file (from the first of the two sets), is attached: [1960-1-01.csv.zip](https://github.com/apache/arrow/files/8102736/1960-1-01.csv.zip) I've tried specifying a schema, as follows: ``` text_ar = open_dataset(noun_phrase_dir, format = 'csv', schema = schema(article_id = string(), phrase = string(), n = int32())) ``` However, when specifying a schema, `open_dataset` includes the header row, and raises an error because the letter n can't be cast to an integer: ``` Error: Invalid: Could not open CSV input source [path to file]: Invalid: In CSV column #2: Row #1: CSV conversion error to int32: invalid value 'n' ``` So then I tried adding the argument `skip = 1`: ``` text_ar = open_dataset(noun_phrase_dir, format = 'csv', schema = schema(article_id = string(), phrase = string(), n = int32()), skip = 1) ``` Which returns this error: ``` Error: The following option is supported in "read_delim_arrow" functions but not yet supported here: "skip" ``` My next approach was to let `open_dataset` just use the default scheme, and cast `n` to `int32` as a second step. I can't seem to figure out how to fit `call_function` into `dplyr` syntax, with a few variations on the following: ``` text_ar |> dplyr::mutate(n = call_function("cast", n, int32())) ``` ``` Error: Expression call_function("cast", n, int32()) not supported in Arrow Call collect() first to pull data into R. ``` The `Table` class has a `$cast` method that might do the trick. But `text_ar` is a `FileSystemDataset` and either doesn't have a `$cast` method or I don't understand what it expects. ``` text_ar$cast(schema(n = int32())) ``` ``` Error: attempt to apply non-function ``` It does have a `$schema` format, but per the docs this doesn't support casting: ``` text_ar$schema <- schema(n = int32()) ``` ``` Error: Type error: fields had matching names but differing types. From: n: int64 To: n: int32 ``` I haven't been able to find any information on converting a `FileSystemDataset` to a `Table`. I have no idea where to go from here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
