[
https://issues.apache.org/jira/browse/ARROW-10088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Neal Richardson updated ARROW-10088:
------------------------------------
Fix Version/s: (was: 2.0.0)
> [R] Don't store "data.table" pointer in metadata
> ------------------------------------------------
>
> Key: ARROW-10088
> URL: https://issues.apache.org/jira/browse/ARROW-10088
> Project: Apache Arrow
> Issue Type: Bug
> Components: R
> Affects Versions: 1.0.1
> Reporter: Kyle Kavanagh
> Assignee: Romain Francois
> Priority: Major
>
> Issues with metadata$r:
> * The ".internal.selfref" attribute from data.table is an externalptr, which
> won't be valid to serialize and restore, so it needs to be dropped (and then
> presumably also the data.table class too)
> -----
> Original description:
> I've got a proprietary dataset where one of the columns is an integer64 but
> all of the values would fit within 32bits. As I understand it, arrow/feather
> will downcast that column when the data is read back into R (not ideal IMO,
> but not an issue generally). However, I'm having some trouble with a
> specific dataset.
> When I read in the data, the column is set to the class "integer64", however
> the column type (typeof) is 'integer' and not 'double', which is the
> underlying type used by bit64. This mismatch causes R data.table to error
> out
> ([https://github.com/Rdatatable/data.table/blob/master/src/rbindlist.c#L325)]
> I do not have any issue with integer64 columns which have values > 2^32, and
> suspiciously I am also unable to recreate the issue by manually creating a
> data.table with an int64 column with small values (e.g
> data.table(col=as.integer64(c(1,2,3))) )
> I did look thru the arrow::r cpp source and couldnt find an obvious case
> where the underlying storage array would be an integer but also have the
> 'integer64' class attr assigned... A fix would either be to remove the
> integer64 class attr, or ensure that the underlying data store is a REALSXP
> instead of INTEGERSXP
> My company's network policies wont let me upload the sample dataset, hoping
> to see if this triggers an immediate thoughts. If not, I can try to figure
> our how to upload the dataset or otherwise provide details from it as
> requested.
>
> {code:java}
> > arrow::write_feather(df[,list(testCol)][1], '~/test.feather')
> > test = arrow::read_feather('~/test.feather')
> > class(test$testCol)
> [1] "integer64" "np.ulong"
> > typeof(test$testCol)
> [1] "integer"
> > str(test)
> Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 1 obs. of 1 variable: $
> testCol:Error in as.character.integer64(object) : REAL() can only be applied
> to a 'numeric', not a 'integer'
> #In the larger original dataset, it handles most columns properly, only the
> 'testCol' breaks things. Note the difference:
> > typeof(df$goodCol)
> [1] "double"
> > class(df$goodCol)
> [1] "integer64" "np.ulong"
> > typeof(df$testCol)
> [1] "integer"
> > class(df$testCol)
> [1] "integer64" "np.ulong"
> > str(df)
> Classes ‘data.table’ and 'data.frame': 214781 obs. of 17 variables:
> $ goodCol :integer64 1599777000000604025 ...
> $ testCol :Error in as.character.integer64(object) :
> > sessionInfo()
> R version 3.6.1 (2019-07-05)Platform: x86_64-pc-linux-gnu (64-bit)Running
> under: Red Hat Enterprise Linux Server 7.7 (Maipo)
> Matrix products: defaultBLAS: /usr/lib64/libblas.so.3.4.2LAPACK:
> /usr/lib64/liblapack.so.3.4.2locale:
> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8
> LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C
> LC_TELEPHONE=C[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
> attached base packages:[1] stats graphics grDevices utils datasets
> methods baseother attached packages:[1] data.table_1.13.0 bit64_4.0.5
> bit_4.0.4loaded via a namespace (and not attached): [1] Rcpp_1.0.5
> lattice_0.20-41 arrow_1.0.1 [4] assertthat_0.2.1 rappdirs_0.3.1
> grid_3.6.1 [7] R6_2.4.1 jsonlite_1.7.1 magrittr_1.5[10]
> rlang_0.4.7 Matrix_1.2-18 vctrs_0.3.4[13]
> reticulate_1.14-9001 tools_3.6.1 glue_1.4.2[16] purrr_0.3.4
> compiler_3.6.1 tidyselect_1.1.0{code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)