[ 
https://issues.apache.org/jira/browse/ARROW-16833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17554638#comment-17554638
 ] 

Zsolt Kegyes-Brassai commented on ARROW-16833:
----------------------------------------------

Hi [~thisisnic], thank you for the quick answer.

I spot a strange behavior in the results of your code: the NA value was dropped 
silently. In my view this is a totally wrong. Do you agree? Shall I register a 
new ticket?

I am afraid that the `convert_options` won’t be the right solution, because it 
is not feasible to list all the erroneous items from a large and real dataset. 

I tried to solve with a {{mutate()}} function, but it’s not working either.

 
{code:java}
library(arrow)
df_numbers <- tibble::tibble(number = c(1,2,3,"error", 4, 5, NA, 6))
readr::write_csv(df_numbers, "/temp/test/numbers.csv")
open_dataset("/temp/test", format = "csv") |> 
  dplyr::mutate(number =  as.integer(number)) |> 
  write_dataset(here::here("/temp/test_ds"), format = "parquet")
#> Error: Invalid: Failed to parse string: 'NA' as a scalar of type int32
{code}
 

{{}}

Let’s share my view, ideas for improvement. The arrow package has the very 
powerful promise that much larger data sizes can processed on an ordinary 
computer/laptop by using the datasets. Because the real data is usually messy, 
some flexible tools/options would  be desired which can deal with data cleaning 
(= type conversion) and column selection/renaming inside an {{open_dataset() -> 
write_dataset()}} code chunk. 

> [R] how to enforce type conversion in open_dataset()
> ----------------------------------------------------
>
>                 Key: ARROW-16833
>                 URL: https://issues.apache.org/jira/browse/ARROW-16833
>             Project: Apache Arrow
>          Issue Type: Improvement
>    Affects Versions: 8.0.0
>            Reporter: Zsolt Kegyes-Brassai
>            Priority: Minor
>
> Here is a small example:
> {{}}
> {code:java}
> library(arrow)
> df_numbers <- tibble::tibble(number = c(1,2,3,"error", 4, 5, NA, 6))
> str(df_numbers)
> #> tibble [8 x 1] (S3: tbl_df/tbl/data.frame)
> #>  $ number: chr [1:8] "1" "2" "3" "error" ...
> write_parquet(df_numbers, "numbers.parquet")
> open_dataset("numbers.parquet") 
> #> FileSystemDataset with 1 Parquet file
> #> number: string
> open_dataset("numbers.parquet", schema(number = int8())) |> dplyr::collect()
> #> Error in `dplyr::collect()`:
> #> ! Invalid: Failed to parse string: 'error' as a scalar of type int8
> {code}
> The expected result is having an input column of integers; where the 
> non-integer values are converted to NAs.
> How this type conversion can be enforced using schema definition in in the  
> {{{}open_dataset(){}}}? 
> Rationale: I would like to include this in a code chunk  which imports a csv 
> dataset and saves to parquet dataset (open_dataset -> write_dataset); where 
> the type conversion based on a preset schema would be done at the same time.  
> And all these steps without loading all the data in memory.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to