Re: R arrow package question

Jacob Wujciak Thu, 26 Jan 2023 23:49:25 -0800

Hello Angelo,

just for completeness sake, which version of arrow are you running?
The error message provides an exact col,row position, have you checked the
value there? That could help pinpoint the cause of the error.
"Invalid: In CSV column #12: Row #580. CSV conversion error to int64:
invalid value"


This reminds me of cases where scientific notation is used for integers
which causes an error but that usually shows the value e.g. "1e6". In those
cases these columns can be defined as double and then cast to int as a
workaround.

Best
Jacob

On Fri, Jan 27, 2023 at 8:17 AM Angelo Casalan <acasalan...@gmail.com>
wrote:

> Hi,
>
>
>
> I hope you are well. I am migrating to arrow from disk.frame in RStudio. I
> am really impressed with how fast arrow compared to disk.frame but I need
> help to solve some errors.
>
>
>
>  I wish to ask how I can resolve this error:
>
>
>
> "CSV conversion error to int64: invalid value"
>
>
>
> To give an idea of my dataset. I have csvs all placed in a local folder.
>
>
>
> I cannot provide a snapshot of the heading and first few rows because of
> data privacy concerns. What I can say is I am working with Census household
> and population datasets of a developing country.
>
>
>
> The code below worked and took less than a second when importing, despite
> being 5.47 GB in size:
>
>
>
> arrow<-open_dataset(
> sources="folder location of  117 csvs on household level data",
> format="csv")
>
>
>
> However, when I run:
>
>
>
> ```
>
> arrow %>% count(column) %>% collect()
> nrow(arrow %>% collect)
>
> head(arrow %>% collect(),10 )
>
> ```
>
>
>
> I always get the same  error message: "Invalid: In CSV column #12: Row
> #580. CSV conversion error to int64: invalid value"
>
>
>
> I tried going back to
>
>
> ```
>
> arrow<-open_dataset(
>
> same set of arguments as earlier,
>
> skip=1,
>
> ,schema()
>
> ).
>
>
> ```
>
>
>  I experimented where the column that is giving me problems is set as utf8
> or large_utf8  or   str in the schema argument.
>
>
> ```
>
> schema(
>
> col=utf8(),
>
> other nth columns
>
> )
>
> ```
>
> But I still encounter the same problem. And everytime I changed the data
> type of one problematic column another data type problem in another column
> arises.
>
>
>
> To compare, in disk.frame, only 3 of these variables are character type (in
> R). colClasses argument worked fine.
>
>
> Also, I tried running the code below:
>
>
>
> ```
>
> arrow<-open_dataset(
> sources="csv location",
> format="csv")
>
>
>
> arrow
>
>
>
> ```
>
>
>
>
>
> to see ‘FileSystemDataset with 117 csv files’
>
>
>
> Then when I see the data type I fill up the schema argument with the
> corresponding data type per column.
>
> I just circled back to the same problems regarding the data types of the
> columns
>
>
>
>
>
> Using this code below fail to work either:
>
>
>
> arrow2<-arrow_table(arrow)
>
>
>
> Thanks in advance if you can help me.
>
>
> --
> Regards,
>
> Angelo Casalan
> Statistical Methodology Unit
>

Re: R arrow package question

Reply via email to