[ 
https://issues.apache.org/jira/browse/ARROW-10088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17202925#comment-17202925
 ] 

Kyle Kavanagh commented on ARROW-10088:
---------------------------------------

Those np class tags are likely coming from my usage of reticulate.  I have not 
been able to try the nightly build as arrow::install_arrow(nightly=T) is still 
giving me 1.0.1 for some reason... however I've implemented logic to yank those 
classes off before writing the file and this seems to have fixed the issue 
described above.  However, this has revealed another issue related to 
downcasting...  That opt-out feature request in the linked ticket above would 
be super helpful to remove this as a variable entirely.

When I have an int64 dataframe column containing many zeros and many epoch 
nanotimes (ie proper int64 values), I'm noticing that arrow::r is incorrectly 
downcasting to int32, while arrow:py reading the same file preserves the in64 
datatype and displays the correct values:

R
{code:java}
> df2 = arrow::read_feather('../0x80cdbe0.feather')
> str(df2)
Classes ‘data.table’ and 'data.frame':  2928489 obs. of  1 variable:
 $ col: int  0 0 0 0 0 0 0 0 0 0 ...
 - attr(*, ".internal.selfref")=<externalptr> 
> str(df2[col>0])
Classes ‘data.table’ and 'data.frame':  2126 obs. of  1 variable:
 $ col: int  911390201 911666786 1949972332 1950146353 609740195 162563531 
1070775442 1377384707 1493499527 446960602 ...
 - attr(*, ".internal.selfref")=<externalptr> 
{code}
Python:
{code:java}
> g=pd.read_feather('../0x80cdbe0.feather')[['col']]
> g.dtypes
col int64
dtype: object
> g.head()
col 
0 0
1 0
2 0
3 0
4 0
5 0                 
> g[g.col > 0].head()
      col
1064     1580686242311484921
1068     1580686242311761506
1262     1580686257372412575
1266     1580686257474045875
3851     1580686342134314860{code}

> [R] Integer64 incorrectly read into R data.table
> ------------------------------------------------
>
>                 Key: ARROW-10088
>                 URL: https://issues.apache.org/jira/browse/ARROW-10088
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: R
>    Affects Versions: 1.0.1
>            Reporter: Kyle Kavanagh
>            Priority: Major
>
> I've got a proprietary dataset where one of the columns is an integer64 but 
> all of the values would fit within 32bits.  As I understand it, arrow/feather 
> will downcast that column when the data is read back into R (not ideal IMO, 
> but not an issue generally).  However, I'm having some trouble with a 
> specific dataset. 
> When I read in the data, the column is set to the class "integer64", however 
> the column type (typeof) is 'integer' and not 'double', which is the 
> underlying type used by bit64.  This mismatch causes R data.table to error 
> out 
> ([https://github.com/Rdatatable/data.table/blob/master/src/rbindlist.c#L325)]
> I do not have any issue with integer64 columns which have values > 2^32, and 
> suspiciously I am also unable to recreate the issue by manually creating a 
> data.table with an int64 column with small values (e.g 
> data.table(col=as.integer64(c(1,2,3))) )
> I did look thru the arrow::r cpp source and couldnt find an obvious case 
> where the underlying storage array would be an integer but also have the 
> 'integer64' class attr assigned...  A fix would either be to remove the 
> integer64 class attr, or ensure that the underlying data store is a REALSXP 
> instead of INTEGERSXP
> My company's network policies wont let me upload the sample dataset, hoping 
> to see if this triggers an immediate thoughts.  If not, I can try to figure 
> our how to upload the dataset or otherwise provide details from it as 
> requested.
>  
> {code:java}
> > arrow::write_feather(df[,list(testCol)][1], '~/test.feather')
> > test = arrow::read_feather('~/test.feather')
> > class(test$testCol)
> [1] "integer64" "np.ulong"
> > typeof(test$testCol)
> [1] "integer"
> > str(test)
> Classes ‘tbl_df’, ‘tbl’ and 'data.frame':       1 obs. of  1 variable: $ 
> testCol:Error in as.character.integer64(object) :  REAL() can only be applied 
> to a 'numeric', not a 'integer'
> #In the larger original dataset, it handles most columns properly, only the 
> 'testCol' breaks things.  Note the difference:
> > typeof(df$goodCol)
> [1] "double"
> > class(df$goodCol)
> [1] "integer64" "np.ulong"
> > typeof(df$testCol)
> [1] "integer"
> > class(df$testCol)
> [1] "integer64" "np.ulong"
> > str(df)
> Classes ‘data.table’ and 'data.frame':  214781 obs. of  17 variables: 
> $ goodCol        :integer64 1599777000000604025 ... 
> $ testCol        :Error in as.character.integer64(object) :
> > sessionInfo()
> R version 3.6.1 (2019-07-05)Platform: x86_64-pc-linux-gnu (64-bit)Running 
> under: Red Hat Enterprise Linux Server 7.7 (Maipo)
> Matrix products: defaultBLAS:   /usr/lib64/libblas.so.3.4.2LAPACK: 
> /usr/lib64/liblapack.so.3.4.2locale: 
> [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8        
> LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8 
> [7] LC_PAPER=en_US.UTF-8       LC_NAME=C [9] LC_ADDRESS=C               
> LC_TELEPHONE=C[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
> attached base packages:[1] stats     graphics  grDevices utils     datasets  
> methods   baseother attached packages:[1] data.table_1.13.0 bit64_4.0.5       
> bit_4.0.4loaded via a namespace (and not attached): [1] Rcpp_1.0.5           
> lattice_0.20-41      arrow_1.0.1 [4] assertthat_0.2.1     rappdirs_0.3.1      
>  grid_3.6.1 [7] R6_2.4.1             jsonlite_1.7.1       magrittr_1.5[10] 
> rlang_0.4.7          Matrix_1.2-18        vctrs_0.3.4[13] 
> reticulate_1.14-9001 tools_3.6.1          glue_1.4.2[16] purrr_0.3.4          
> compiler_3.6.1       tidyselect_1.1.0{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to