dhicks opened a new issue #12469:
URL: https://github.com/apache/arrow/issues/12469


   Background: I'm trying to use arrow to (a) assemble two large sets of CSVs 
into two (folders of) parquet files and then (b) combine the two into a single 
working dataset.  The two sets of CSVs all have the same column structure:  
`article_id` (string), `phrase` (string), `n` (integer).  Without specifying a 
schema, in step (a) `open_dataset` parses `n` as `int64` in the first set and 
as `int32` in the second set.  This results in an error when trying to combine 
them (because, I guess, automatic casting from int32 to int64 isn't supported 
yet).  The maximum value of `n` across the entire first set is 576, so I'm not 
sure why arrow thinks it needs 64 bits in the first place.  
   
   Shorter background:  In a collection of CSVs, `open_dataset` parses the 
column `n` as `int64` and I need it to be `int32`.  
   
   An example CSV, the first file (from the first of the two sets), is 
attached: 
[1960-1-01.csv.zip](https://github.com/apache/arrow/files/8102736/1960-1-01.csv.zip)
  
   
   I've tried specifying a schema, as follows:  
   ```
   text_ar = open_dataset(noun_phrase_dir, 
                          format = 'csv',
                          schema = schema(article_id = string(), 
                                          phrase = string(), 
                                          n = int32()))
   ```
   However, when specifying a schema, `open_dataset` includes the header row, 
and raises an error because the letter n can't be cast to an integer: 
   ```
   Error: Invalid: Could not open CSV input source [path to file]: Invalid: In 
CSV column #2: Row #1: CSV conversion error to int32: invalid value 'n'
   ```
   So then I tried adding the argument `skip = 1`:
   ```
   text_ar = open_dataset(noun_phrase_dir, 
                          format = 'csv',
                          schema = schema(article_id = string(), 
                                          phrase = string(), 
                                          n = int32()), 
                          skip = 1)
   ```
   Which returns this error:
   ```
   Error: The following option is supported in "read_delim_arrow" functions but 
not yet supported here: "skip"
   ```
   
   My next approach was to let `open_dataset` just use the default scheme, and 
cast `n` to `int32` as a second step.  I can't seem to figure out how to fit 
`call_function` into `dplyr` syntax, with a few variations on the following: 
   ```
   text_ar |> 
       dplyr::mutate(n = call_function("cast", n, int32()))
   ```
   ```
   Error: Expression call_function("cast", n, int32()) not supported in Arrow
   Call collect() first to pull data into R.
   ```
   
   The `Table` class has a `$cast` method that might do the trick.  But 
`text_ar` is a `FileSystemDataset` and either doesn't have a `$cast` method or 
I don't understand what it expects.  
   ```
   text_ar$cast(schema(n = int32()))
   ```
   ```
   Error: attempt to apply non-function
   ```
   It does have a `$schema` format, but per the docs this doesn't support 
casting: 
   ```
   text_ar$schema <- schema(n = int32())
   ```
   ```
   Error: Type error: fields had matching names but differing types. From: n: 
int64 To: n: int32
   ```
   
   I haven't been able to find any information on converting a 
`FileSystemDataset` to a `Table`.  I have no idea where to go from here. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to