[ 
https://issues.apache.org/jira/browse/ARROW-16157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521577#comment-17521577
 ] 

Egill Axfjord Fridgeirsson commented on ARROW-16157:
----------------------------------------------------

Hi [~thisisnic] , 

I updated to the dev version and unfortunately I still get the issue. 

Here is my arrow::info() output if that helps

{code:java}
 > arrow::arrow_info()
Arrow package version: 7.0.0.20220412

Capabilities:
               
dataset    TRUE
engine    FALSE
parquet    TRUE
json       TRUE
s3        FALSE
utf8proc   TRUE
re2        TRUE
snappy     TRUE
gzip      FALSE
brotli    FALSE
zstd      FALSE
lz4        TRUE
lz4_frame  TRUE
lzo       FALSE
bz2       FALSE
jemalloc  FALSE
mimalloc  FALSE

To reinstall with more optional capabilities enabled, see
   https://arrow.apache.org/docs/r/articles/install.html

Memory:
                  
Allocator   system
Current   76.29 Mb
Max        76.3 Mb

Runtime:
                        
SIMD Level          avx2
Detected SIMD Level avx2

Build:
                                   
C++ Library Version  8.0.0-SNAPSHOT
C++ Compiler                    GNU
C++ Compiler Version         11.2.0
{code}


> [R] Inconsistent behavior for arrow datasets vs working in memory
> -----------------------------------------------------------------
>
>                 Key: ARROW-16157
>                 URL: https://issues.apache.org/jira/browse/ARROW-16157
>             Project: Apache Arrow
>          Issue Type: Bug
>    Affects Versions: 7.0.0
>         Environment: Ubuntu 21.10
> R 4.1.3.
> Arrow 7.0.0
>            Reporter: Egill Axfjord Fridgeirsson
>            Assignee: Nicola Crane
>            Priority: Major
>
> When I generate a sparse matrix using indices from an arrow dataset I get 
> inconsistent behavior, sometimes there are duplicated indexes resulting in a 
> matrix with values more than one at some places. When loading the dataset 
> first in memory everything works as expected and all the values are one
> Repro
> {code:java}
> library(Matrix)
> library(dplyr)
> library(arrow)
> sparseMatrix <- Matrix::rsparsematrix(1e5,1e3, 0.05, repr="T")
> dF <- data.frame(i=sparseMatrix@i + 1, j=sparseMatrix@j + 1)
> arrow::write_dataset(dF, path='./data/feather', format='feather')
> arrowDataset <- arrow::open_dataset('./data/feather', format='feather')
> # run the below a few times, and at some time the output is more than just # 
> 1 for unique(newSparse@x), indicating there are 
> # duplicate indices for the sparse matrix (then it adds the values there)
> newSparse <- Matrix::sparseMatrix(i = arrowDataset %>% pull(i) ,
>                                   j = arrowDataset %>% pull(j),
>                                   x = 1)
> unique(newSparse@x) # here is the bug, @x is the slot for values
> arrowInMemory <- arrowDataset %>% collect()
> # after loading in memory the output is never more than 1 no matter how 
> # often I run it
> newSparse <- Matrix::sparseMatrix(i = arrowInMemory %>% pull(i) ,
>                                   j = arrowInMemory %>% pull(j),
>                                   x = 1)
> unique(newSparse@x){code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to