Re: R arrow package question

Dewey Dunnington Wed, 01 Feb 2023 05:14:54 -0800

In case it hasn't already been mentioned here, I wonder if manually setting
`schema()` would help. You're correct that the invalid value isn't
scientific notation (i.e., it's a blank string) so maybe that column should
be a string column instead. You could get the guessed schema from the
original open_dataset(), modify it to change any problematic columns to
"string" type, then open the dataset again and try to collect (example
below). I am guessing that disk.frame and arrow have different methods they
use to guess schemas which is why you're seeing the difference.


```
library(arrow, warn.conflicts = FALSE)
#> Some features are not enabled in this build of Arrow. Run `arrow_info()`
for more information.

temp <- tempfile()
writeLines("col1,col2\na,1\nb,2\n", temp)
cat(readLines(temp), sep = "\n")
#> col1,col2
#> a,1
#> b,2

(ds <- open_dataset(temp, format = "csv"))
#> FileSystemDataset with 1 csv file
#> col1: string
#> col2: int64
schema <- ds$schema
schema$col2 <- string()

(ds2 <- open_dataset(temp, format = "csv", schema = schema))
#> FileSystemDataset with 1 csv file
#> col1: string
#> col2: string
```

On Tue, Jan 31, 2023 at 6:43 PM Nic Crane <thisis...@gmail.com> wrote:

> Hi Angelo,
>
> The original code with just `open_dataset()` works as it's created a
> dataset without actually pulling the data into your R session.  The
> subsequent commands you tried (i.e. involving `collect()` read in the
> files, resulting in an error when the data is read in.
>
> It looks like there's an invalid value in your dataset which is causing it
> to fail to load.  From the error message you see there, it looks like it's
> in the 12th column of your data in row 580.  I think when Jacob asked "have
> you checked the value there", another way of phrasing what he said would be
> to ask if you have manually checked the contents of whichever CSV is
> causing the problem, in row 580 and column 12, to see what value is there?
> (rather than checking the data type/value reported by Arrow).
>
> It's going to be tricky to help diagnose the issue without a reproducible
> example. If I'm working with a larger dataset, I usually narrow down the
> issue by dividing it into two smaller datasets and running the code on each
> to see which one contains the problematic row, and then keep going until I
> find the row which is failing to load.  If you can get to the point where
> you can pinpoint the exact values which are causing problems, this will be
> the quickest way we can help you.
>
> Best wishes,
>
> Nic
>
> On Tue, 31 Jan 2023 at 00:52, Angelo Casalan <acasalan...@gmail.com>
> wrote:
>
> > Hi Jacob,
> >
> > Thanks. To provide some specifics on my query:
> >
> > 1.which version of arrow are you running?
> > - 10.0.1
> >
> > 2. The error message provides an exact col,row position, have you checked
> > the value there?
> > Yes. It is int64. This is after running open_dataset without specifying
> > schema:
> > '''
> > arrow<-open_dataset(
> > sources="location of csv files",
> > format="csv"
> > )
> > '''
> >
> >  3. I have to correct the exact error message:
> > CSV conversion error to int64:invalid value ' '
> > I think arrow tells me the invalid value present is ' '
> >
> >  4. This reminds me of cases where scientific notation is used for
> integers
> >  which causes an error but that usually shows the value e.g. "1e6".
> > the invalid value is: ' '
> >
> > 5. I am really confused because using disk.frame() function, on the same
> > csvs, I have not encountered this problem on this column because it was
> > cleanly encoded as a numeric variable.
> >
> > Regards,
> >
> >
> >
> > On Fri, Jan 27, 2023 at 9:43 AM Angelo Casalan <acasalan...@gmail.com>
> > wrote:
> >
> > > Hi ,
> > >
> > > I hope you are well. I wish to ask how I can resolve this error:
> > >
> > > "CSV conversion error to int64: invalid value"
> > >
> > >
> > > To give an idea of my dataset. I have 4 csvs all placed in a local
> > folder.
> > >
> > >
> > > The code below worked when importing:
> > >
> > >
> > > arrow<-open_dataset(
> > > sources="csv location",
> > > format="csv")
> > >
> > >
> > > However, when I run:
> > >
> > >
> > > arrow %>% count(column) %>% collect()
> > > nrow(arrow %>% collect)
> > >
> > > head(arrow %>% collect(),10 )
> > >
> > > I always get the same  error message: "Invalid: In CSV column #12: Row
> > > #580. CSV conversion error to int64: invalid value"
> > >
> > > I tried going back to open_dataset(,schema() ). Where the column that
> is
> > > giving me problems is set as utf8 or sometimes str in the schema
> > argument.
> > >
> > > schema(
> > > col=utf8(),
> > > other nth columns
> > > )
> > >
> > > But I still encounter the same problem.
> > >
> > > Using this code below fail to work either.
> > >
> > > arrow2<-arrow_table(arrow)
> > >
> > > Thanks in advance if you can help me.
> > >
> > > --
> > > Regards,
> > >
> > > Angelo Casalan
> > > Statistical Methodology Unit
> > >
> >
> >
> > --
> > Regards,
> >
> > Angelo Casalan
> > Statistical Methodology Unit
> >
>

Re: R arrow package question

Reply via email to