Hi,


I hope you are well. I am migrating to arrow from disk.frame in RStudio. I
am really impressed with how fast arrow compared to disk.frame but I need
help to solve some errors.



 I wish to ask how I can resolve this error:



"CSV conversion error to int64: invalid value"



To give an idea of my dataset. I have csvs all placed in a local folder.



I cannot provide a snapshot of the heading and first few rows because of
data privacy concerns. What I can say is I am working with Census household
and population datasets of a developing country.



The code below worked and took less than a second when importing, despite
being 5.47 GB in size:



arrow<-open_dataset(
sources="folder location of  117 csvs on household level data",
format="csv")



However, when I run:



```

arrow %>% count(column) %>% collect()
nrow(arrow %>% collect)

head(arrow %>% collect(),10 )

```



I always get the same  error message: "Invalid: In CSV column #12: Row
#580. CSV conversion error to int64: invalid value"



I tried going back to


```

arrow<-open_dataset(

same set of arguments as earlier,

skip=1,

,schema()

).


```


 I experimented where the column that is giving me problems is set as utf8
or large_utf8  or   str in the schema argument.


```

schema(

col=utf8(),

other nth columns

)

```

But I still encounter the same problem. And everytime I changed the data
type of one problematic column another data type problem in another column
arises.



To compare, in disk.frame, only 3 of these variables are character type (in
R). colClasses argument worked fine.


Also, I tried running the code below:



```

arrow<-open_dataset(
sources="csv location",
format="csv")



arrow



```





to see ‘FileSystemDataset with 117 csv files’



Then when I see the data type I fill up the schema argument with the
corresponding data type per column.

I just circled back to the same problems regarding the data types of the
columns





Using this code below fail to work either:



arrow2<-arrow_table(arrow)



Thanks in advance if you can help me.


-- 
Regards,

Angelo Casalan
Statistical Methodology Unit

Reply via email to