Hi,
I hope you are well. I am migrating to arrow from disk.frame in RStudio. I am really impressed with how fast arrow compared to disk.frame but I need help to solve some errors. I wish to ask how I can resolve this error: "CSV conversion error to int64: invalid value" To give an idea of my dataset. I have csvs all placed in a local folder. I cannot provide a snapshot of the heading and first few rows because of data privacy concerns. What I can say is I am working with Census household and population datasets of a developing country. The code below worked and took less than a second when importing, despite being 5.47 GB in size: arrow<-open_dataset( sources="folder location of 117 csvs on household level data", format="csv") However, when I run: ``` arrow %>% count(column) %>% collect() nrow(arrow %>% collect) head(arrow %>% collect(),10 ) ``` I always get the same error message: "Invalid: In CSV column #12: Row #580. CSV conversion error to int64: invalid value" I tried going back to ``` arrow<-open_dataset( same set of arguments as earlier, skip=1, ,schema() ). ``` I experimented where the column that is giving me problems is set as utf8 or large_utf8 or str in the schema argument. ``` schema( col=utf8(), other nth columns ) ``` But I still encounter the same problem. And everytime I changed the data type of one problematic column another data type problem in another column arises. To compare, in disk.frame, only 3 of these variables are character type (in R). colClasses argument worked fine. Also, I tried running the code below: ``` arrow<-open_dataset( sources="csv location", format="csv") arrow ``` to see ‘FileSystemDataset with 117 csv files’ Then when I see the data type I fill up the schema argument with the corresponding data type per column. I just circled back to the same problems regarding the data types of the columns Using this code below fail to work either: arrow2<-arrow_table(arrow) Thanks in advance if you can help me. -- Regards, Angelo Casalan Statistical Methodology Unit