Hi Angelo, The original code with just `open_dataset()` works as it's created a dataset without actually pulling the data into your R session. The subsequent commands you tried (i.e. involving `collect()` read in the files, resulting in an error when the data is read in.
It looks like there's an invalid value in your dataset which is causing it to fail to load. From the error message you see there, it looks like it's in the 12th column of your data in row 580. I think when Jacob asked "have you checked the value there", another way of phrasing what he said would be to ask if you have manually checked the contents of whichever CSV is causing the problem, in row 580 and column 12, to see what value is there? (rather than checking the data type/value reported by Arrow). It's going to be tricky to help diagnose the issue without a reproducible example. If I'm working with a larger dataset, I usually narrow down the issue by dividing it into two smaller datasets and running the code on each to see which one contains the problematic row, and then keep going until I find the row which is failing to load. If you can get to the point where you can pinpoint the exact values which are causing problems, this will be the quickest way we can help you. Best wishes, Nic On Tue, 31 Jan 2023 at 00:52, Angelo Casalan <[email protected]> wrote: > Hi Jacob, > > Thanks. To provide some specifics on my query: > > 1.which version of arrow are you running? > - 10.0.1 > > 2. The error message provides an exact col,row position, have you checked > the value there? > Yes. It is int64. This is after running open_dataset without specifying > schema: > ''' > arrow<-open_dataset( > sources="location of csv files", > format="csv" > ) > ''' > > 3. I have to correct the exact error message: > CSV conversion error to int64:invalid value ' ' > I think arrow tells me the invalid value present is ' ' > > 4. This reminds me of cases where scientific notation is used for integers > which causes an error but that usually shows the value e.g. "1e6". > the invalid value is: ' ' > > 5. I am really confused because using disk.frame() function, on the same > csvs, I have not encountered this problem on this column because it was > cleanly encoded as a numeric variable. > > Regards, > > > > On Fri, Jan 27, 2023 at 9:43 AM Angelo Casalan <[email protected]> > wrote: > > > Hi , > > > > I hope you are well. I wish to ask how I can resolve this error: > > > > "CSV conversion error to int64: invalid value" > > > > > > To give an idea of my dataset. I have 4 csvs all placed in a local > folder. > > > > > > The code below worked when importing: > > > > > > arrow<-open_dataset( > > sources="csv location", > > format="csv") > > > > > > However, when I run: > > > > > > arrow %>% count(column) %>% collect() > > nrow(arrow %>% collect) > > > > head(arrow %>% collect(),10 ) > > > > I always get the same error message: "Invalid: In CSV column #12: Row > > #580. CSV conversion error to int64: invalid value" > > > > I tried going back to open_dataset(,schema() ). Where the column that is > > giving me problems is set as utf8 or sometimes str in the schema > argument. > > > > schema( > > col=utf8(), > > other nth columns > > ) > > > > But I still encounter the same problem. > > > > Using this code below fail to work either. > > > > arrow2<-arrow_table(arrow) > > > > Thanks in advance if you can help me. > > > > -- > > Regards, > > > > Angelo Casalan > > Statistical Methodology Unit > > > > > -- > Regards, > > Angelo Casalan > Statistical Methodology Unit >
