In case it hasn't already been mentioned here, I wonder if manually setting `schema()` would help. You're correct that the invalid value isn't scientific notation (i.e., it's a blank string) so maybe that column should be a string column instead. You could get the guessed schema from the original open_dataset(), modify it to change any problematic columns to "string" type, then open the dataset again and try to collect (example below). I am guessing that disk.frame and arrow have different methods they use to guess schemas which is why you're seeing the difference.
``` library(arrow, warn.conflicts = FALSE) #> Some features are not enabled in this build of Arrow. Run `arrow_info()` for more information. temp <- tempfile() writeLines("col1,col2\na,1\nb,2\n", temp) cat(readLines(temp), sep = "\n") #> col1,col2 #> a,1 #> b,2 (ds <- open_dataset(temp, format = "csv")) #> FileSystemDataset with 1 csv file #> col1: string #> col2: int64 schema <- ds$schema schema$col2 <- string() (ds2 <- open_dataset(temp, format = "csv", schema = schema)) #> FileSystemDataset with 1 csv file #> col1: string #> col2: string ``` On Tue, Jan 31, 2023 at 6:43 PM Nic Crane <thisis...@gmail.com> wrote: > Hi Angelo, > > The original code with just `open_dataset()` works as it's created a > dataset without actually pulling the data into your R session. The > subsequent commands you tried (i.e. involving `collect()` read in the > files, resulting in an error when the data is read in. > > It looks like there's an invalid value in your dataset which is causing it > to fail to load. From the error message you see there, it looks like it's > in the 12th column of your data in row 580. I think when Jacob asked "have > you checked the value there", another way of phrasing what he said would be > to ask if you have manually checked the contents of whichever CSV is > causing the problem, in row 580 and column 12, to see what value is there? > (rather than checking the data type/value reported by Arrow). > > It's going to be tricky to help diagnose the issue without a reproducible > example. If I'm working with a larger dataset, I usually narrow down the > issue by dividing it into two smaller datasets and running the code on each > to see which one contains the problematic row, and then keep going until I > find the row which is failing to load. If you can get to the point where > you can pinpoint the exact values which are causing problems, this will be > the quickest way we can help you. > > Best wishes, > > Nic > > On Tue, 31 Jan 2023 at 00:52, Angelo Casalan <acasalan...@gmail.com> > wrote: > > > Hi Jacob, > > > > Thanks. To provide some specifics on my query: > > > > 1.which version of arrow are you running? > > - 10.0.1 > > > > 2. The error message provides an exact col,row position, have you checked > > the value there? > > Yes. It is int64. This is after running open_dataset without specifying > > schema: > > ''' > > arrow<-open_dataset( > > sources="location of csv files", > > format="csv" > > ) > > ''' > > > > 3. I have to correct the exact error message: > > CSV conversion error to int64:invalid value ' ' > > I think arrow tells me the invalid value present is ' ' > > > > 4. This reminds me of cases where scientific notation is used for > integers > > which causes an error but that usually shows the value e.g. "1e6". > > the invalid value is: ' ' > > > > 5. I am really confused because using disk.frame() function, on the same > > csvs, I have not encountered this problem on this column because it was > > cleanly encoded as a numeric variable. > > > > Regards, > > > > > > > > On Fri, Jan 27, 2023 at 9:43 AM Angelo Casalan <acasalan...@gmail.com> > > wrote: > > > > > Hi , > > > > > > I hope you are well. I wish to ask how I can resolve this error: > > > > > > "CSV conversion error to int64: invalid value" > > > > > > > > > To give an idea of my dataset. I have 4 csvs all placed in a local > > folder. > > > > > > > > > The code below worked when importing: > > > > > > > > > arrow<-open_dataset( > > > sources="csv location", > > > format="csv") > > > > > > > > > However, when I run: > > > > > > > > > arrow %>% count(column) %>% collect() > > > nrow(arrow %>% collect) > > > > > > head(arrow %>% collect(),10 ) > > > > > > I always get the same error message: "Invalid: In CSV column #12: Row > > > #580. CSV conversion error to int64: invalid value" > > > > > > I tried going back to open_dataset(,schema() ). Where the column that > is > > > giving me problems is set as utf8 or sometimes str in the schema > > argument. > > > > > > schema( > > > col=utf8(), > > > other nth columns > > > ) > > > > > > But I still encounter the same problem. > > > > > > Using this code below fail to work either. > > > > > > arrow2<-arrow_table(arrow) > > > > > > Thanks in advance if you can help me. > > > > > > -- > > > Regards, > > > > > > Angelo Casalan > > > Statistical Methodology Unit > > > > > > > > > -- > > Regards, > > > > Angelo Casalan > > Statistical Methodology Unit > > >