[ 
https://issues.apache.org/jira/browse/ARROW-10088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-10088:
------------------------------------
    Description: 
Issues with metadata$r:

* Handling integer64 (and subclasses) when relating to automatic downcasting
* The ".internal.selfref" attribute from data.table is an externalptr, which 
won't be valid to serialize and restore, so it needs to be dropped (and then 
presumably also the data.table class too)

-----
Original description:

I've got a proprietary dataset where one of the columns is an integer64 but all 
of the values would fit within 32bits.  As I understand it, arrow/feather will 
downcast that column when the data is read back into R (not ideal IMO, but not 
an issue generally).  However, I'm having some trouble with a specific dataset. 

When I read in the data, the column is set to the class "integer64", however 
the column type (typeof) is 'integer' and not 'double', which is the underlying 
type used by bit64.  This mismatch causes R data.table to error out 
([https://github.com/Rdatatable/data.table/blob/master/src/rbindlist.c#L325)]

I do not have any issue with integer64 columns which have values > 2^32, and 
suspiciously I am also unable to recreate the issue by manually creating a 
data.table with an int64 column with small values (e.g 
data.table(col=as.integer64(c(1,2,3))) )

I did look thru the arrow::r cpp source and couldnt find an obvious case where 
the underlying storage array would be an integer but also have the 'integer64' 
class attr assigned...  A fix would either be to remove the integer64 class 
attr, or ensure that the underlying data store is a REALSXP instead of 
INTEGERSXP

My company's network policies wont let me upload the sample dataset, hoping to 
see if this triggers an immediate thoughts.  If not, I can try to figure our 
how to upload the dataset or otherwise provide details from it as requested.

 
{code:java}
> arrow::write_feather(df[,list(testCol)][1], '~/test.feather')
> test = arrow::read_feather('~/test.feather')
> class(test$testCol)
[1] "integer64" "np.ulong"
> typeof(test$testCol)
[1] "integer"

> str(test)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':       1 obs. of  1 variable: $ 
testCol:Error in as.character.integer64(object) :  REAL() can only be applied 
to a 'numeric', not a 'integer'


#In the larger original dataset, it handles most columns properly, only the 
'testCol' breaks things.  Note the difference:
> typeof(df$goodCol)
[1] "double"
> class(df$goodCol)
[1] "integer64" "np.ulong"

> typeof(df$testCol)
[1] "integer"
> class(df$testCol)
[1] "integer64" "np.ulong"

> str(df)
Classes ‘data.table’ and 'data.frame':  214781 obs. of  17 variables: 
$ goodCol        :integer64 1599777000000604025 ... 
$ testCol        :Error in as.character.integer64(object) :

> sessionInfo()
R version 3.6.1 (2019-07-05)Platform: x86_64-pc-linux-gnu (64-bit)Running 
under: Red Hat Enterprise Linux Server 7.7 (Maipo)
Matrix products: defaultBLAS:   /usr/lib64/libblas.so.3.4.2LAPACK: 
/usr/lib64/liblapack.so.3.4.2locale: 

[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8        
LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8 
[7] LC_PAPER=en_US.UTF-8       LC_NAME=C [9] LC_ADDRESS=C               
LC_TELEPHONE=C[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:[1] stats     graphics  grDevices utils     datasets  
methods   baseother attached packages:[1] data.table_1.13.0 bit64_4.0.5       
bit_4.0.4loaded via a namespace (and not attached): [1] Rcpp_1.0.5           
lattice_0.20-41      arrow_1.0.1 [4] assertthat_0.2.1     rappdirs_0.3.1       
grid_3.6.1 [7] R6_2.4.1             jsonlite_1.7.1       magrittr_1.5[10] 
rlang_0.4.7          Matrix_1.2-18        vctrs_0.3.4[13] reticulate_1.14-9001 
tools_3.6.1          glue_1.4.2[16] purrr_0.3.4          compiler_3.6.1       
tidyselect_1.1.0{code}

  was:
Issues with metadata$r:

* Handling subclasses of integer64 when relating to automatic downcasting
* The ".internal.selfref" attribute from data.table is an externalptr, which 
won't be valid to serialize and restore, so it needs to be dropped (and then 
presumably also the data.table class too)

-----
Original description:

I've got a proprietary dataset where one of the columns is an integer64 but all 
of the values would fit within 32bits.  As I understand it, arrow/feather will 
downcast that column when the data is read back into R (not ideal IMO, but not 
an issue generally).  However, I'm having some trouble with a specific dataset. 

When I read in the data, the column is set to the class "integer64", however 
the column type (typeof) is 'integer' and not 'double', which is the underlying 
type used by bit64.  This mismatch causes R data.table to error out 
([https://github.com/Rdatatable/data.table/blob/master/src/rbindlist.c#L325)]

I do not have any issue with integer64 columns which have values > 2^32, and 
suspiciously I am also unable to recreate the issue by manually creating a 
data.table with an int64 column with small values (e.g 
data.table(col=as.integer64(c(1,2,3))) )

I did look thru the arrow::r cpp source and couldnt find an obvious case where 
the underlying storage array would be an integer but also have the 'integer64' 
class attr assigned...  A fix would either be to remove the integer64 class 
attr, or ensure that the underlying data store is a REALSXP instead of 
INTEGERSXP

My company's network policies wont let me upload the sample dataset, hoping to 
see if this triggers an immediate thoughts.  If not, I can try to figure our 
how to upload the dataset or otherwise provide details from it as requested.

 
{code:java}
> arrow::write_feather(df[,list(testCol)][1], '~/test.feather')
> test = arrow::read_feather('~/test.feather')
> class(test$testCol)
[1] "integer64" "np.ulong"
> typeof(test$testCol)
[1] "integer"

> str(test)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':       1 obs. of  1 variable: $ 
testCol:Error in as.character.integer64(object) :  REAL() can only be applied 
to a 'numeric', not a 'integer'


#In the larger original dataset, it handles most columns properly, only the 
'testCol' breaks things.  Note the difference:
> typeof(df$goodCol)
[1] "double"
> class(df$goodCol)
[1] "integer64" "np.ulong"

> typeof(df$testCol)
[1] "integer"
> class(df$testCol)
[1] "integer64" "np.ulong"

> str(df)
Classes ‘data.table’ and 'data.frame':  214781 obs. of  17 variables: 
$ goodCol        :integer64 1599777000000604025 ... 
$ testCol        :Error in as.character.integer64(object) :

> sessionInfo()
R version 3.6.1 (2019-07-05)Platform: x86_64-pc-linux-gnu (64-bit)Running 
under: Red Hat Enterprise Linux Server 7.7 (Maipo)
Matrix products: defaultBLAS:   /usr/lib64/libblas.so.3.4.2LAPACK: 
/usr/lib64/liblapack.so.3.4.2locale: 

[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8        
LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8 
[7] LC_PAPER=en_US.UTF-8       LC_NAME=C [9] LC_ADDRESS=C               
LC_TELEPHONE=C[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:[1] stats     graphics  grDevices utils     datasets  
methods   baseother attached packages:[1] data.table_1.13.0 bit64_4.0.5       
bit_4.0.4loaded via a namespace (and not attached): [1] Rcpp_1.0.5           
lattice_0.20-41      arrow_1.0.1 [4] assertthat_0.2.1     rappdirs_0.3.1       
grid_3.6.1 [7] R6_2.4.1             jsonlite_1.7.1       magrittr_1.5[10] 
rlang_0.4.7          Matrix_1.2-18        vctrs_0.3.4[13] reticulate_1.14-9001 
tools_3.6.1          glue_1.4.2[16] purrr_0.3.4          compiler_3.6.1       
tidyselect_1.1.0{code}


> [R] Issues in restoring R metadata for "integer64", "data.table" classes
> ------------------------------------------------------------------------
>
>                 Key: ARROW-10088
>                 URL: https://issues.apache.org/jira/browse/ARROW-10088
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: R
>    Affects Versions: 1.0.1
>            Reporter: Kyle Kavanagh
>            Priority: Major
>             Fix For: 2.0.0
>
>
> Issues with metadata$r:
> * Handling integer64 (and subclasses) when relating to automatic downcasting
> * The ".internal.selfref" attribute from data.table is an externalptr, which 
> won't be valid to serialize and restore, so it needs to be dropped (and then 
> presumably also the data.table class too)
> -----
> Original description:
> I've got a proprietary dataset where one of the columns is an integer64 but 
> all of the values would fit within 32bits.  As I understand it, arrow/feather 
> will downcast that column when the data is read back into R (not ideal IMO, 
> but not an issue generally).  However, I'm having some trouble with a 
> specific dataset. 
> When I read in the data, the column is set to the class "integer64", however 
> the column type (typeof) is 'integer' and not 'double', which is the 
> underlying type used by bit64.  This mismatch causes R data.table to error 
> out 
> ([https://github.com/Rdatatable/data.table/blob/master/src/rbindlist.c#L325)]
> I do not have any issue with integer64 columns which have values > 2^32, and 
> suspiciously I am also unable to recreate the issue by manually creating a 
> data.table with an int64 column with small values (e.g 
> data.table(col=as.integer64(c(1,2,3))) )
> I did look thru the arrow::r cpp source and couldnt find an obvious case 
> where the underlying storage array would be an integer but also have the 
> 'integer64' class attr assigned...  A fix would either be to remove the 
> integer64 class attr, or ensure that the underlying data store is a REALSXP 
> instead of INTEGERSXP
> My company's network policies wont let me upload the sample dataset, hoping 
> to see if this triggers an immediate thoughts.  If not, I can try to figure 
> our how to upload the dataset or otherwise provide details from it as 
> requested.
>  
> {code:java}
> > arrow::write_feather(df[,list(testCol)][1], '~/test.feather')
> > test = arrow::read_feather('~/test.feather')
> > class(test$testCol)
> [1] "integer64" "np.ulong"
> > typeof(test$testCol)
> [1] "integer"
> > str(test)
> Classes ‘tbl_df’, ‘tbl’ and 'data.frame':       1 obs. of  1 variable: $ 
> testCol:Error in as.character.integer64(object) :  REAL() can only be applied 
> to a 'numeric', not a 'integer'
> #In the larger original dataset, it handles most columns properly, only the 
> 'testCol' breaks things.  Note the difference:
> > typeof(df$goodCol)
> [1] "double"
> > class(df$goodCol)
> [1] "integer64" "np.ulong"
> > typeof(df$testCol)
> [1] "integer"
> > class(df$testCol)
> [1] "integer64" "np.ulong"
> > str(df)
> Classes ‘data.table’ and 'data.frame':  214781 obs. of  17 variables: 
> $ goodCol        :integer64 1599777000000604025 ... 
> $ testCol        :Error in as.character.integer64(object) :
> > sessionInfo()
> R version 3.6.1 (2019-07-05)Platform: x86_64-pc-linux-gnu (64-bit)Running 
> under: Red Hat Enterprise Linux Server 7.7 (Maipo)
> Matrix products: defaultBLAS:   /usr/lib64/libblas.so.3.4.2LAPACK: 
> /usr/lib64/liblapack.so.3.4.2locale: 
> [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8        
> LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8 
> [7] LC_PAPER=en_US.UTF-8       LC_NAME=C [9] LC_ADDRESS=C               
> LC_TELEPHONE=C[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
> attached base packages:[1] stats     graphics  grDevices utils     datasets  
> methods   baseother attached packages:[1] data.table_1.13.0 bit64_4.0.5       
> bit_4.0.4loaded via a namespace (and not attached): [1] Rcpp_1.0.5           
> lattice_0.20-41      arrow_1.0.1 [4] assertthat_0.2.1     rappdirs_0.3.1      
>  grid_3.6.1 [7] R6_2.4.1             jsonlite_1.7.1       magrittr_1.5[10] 
> rlang_0.4.7          Matrix_1.2-18        vctrs_0.3.4[13] 
> reticulate_1.14-9001 tools_3.6.1          glue_1.4.2[16] purrr_0.3.4          
> compiler_3.6.1       tidyselect_1.1.0{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to