[
https://issues.apache.org/jira/browse/ARROW-16157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521577#comment-17521577
]
Egill Axfjord Fridgeirsson commented on ARROW-16157:
----------------------------------------------------
Hi [~thisisnic] ,
I updated to the dev version and unfortunately I still get the issue.
Here is my arrow::info() output if that helps
{code:java}
> arrow::arrow_info()
Arrow package version: 7.0.0.20220412
Capabilities:
dataset TRUE
engine FALSE
parquet TRUE
json TRUE
s3 FALSE
utf8proc TRUE
re2 TRUE
snappy TRUE
gzip FALSE
brotli FALSE
zstd FALSE
lz4 TRUE
lz4_frame TRUE
lzo FALSE
bz2 FALSE
jemalloc FALSE
mimalloc FALSE
To reinstall with more optional capabilities enabled, see
https://arrow.apache.org/docs/r/articles/install.html
Memory:
Allocator system
Current 76.29 Mb
Max 76.3 Mb
Runtime:
SIMD Level avx2
Detected SIMD Level avx2
Build:
C++ Library Version 8.0.0-SNAPSHOT
C++ Compiler GNU
C++ Compiler Version 11.2.0
{code}
> [R] Inconsistent behavior for arrow datasets vs working in memory
> -----------------------------------------------------------------
>
> Key: ARROW-16157
> URL: https://issues.apache.org/jira/browse/ARROW-16157
> Project: Apache Arrow
> Issue Type: Bug
> Affects Versions: 7.0.0
> Environment: Ubuntu 21.10
> R 4.1.3.
> Arrow 7.0.0
> Reporter: Egill Axfjord Fridgeirsson
> Assignee: Nicola Crane
> Priority: Major
>
> When I generate a sparse matrix using indices from an arrow dataset I get
> inconsistent behavior, sometimes there are duplicated indexes resulting in a
> matrix with values more than one at some places. When loading the dataset
> first in memory everything works as expected and all the values are one
> Repro
> {code:java}
> library(Matrix)
> library(dplyr)
> library(arrow)
> sparseMatrix <- Matrix::rsparsematrix(1e5,1e3, 0.05, repr="T")
> dF <- data.frame(i=sparseMatrix@i + 1, j=sparseMatrix@j + 1)
> arrow::write_dataset(dF, path='./data/feather', format='feather')
> arrowDataset <- arrow::open_dataset('./data/feather', format='feather')
> # run the below a few times, and at some time the output is more than just #
> 1 for unique(newSparse@x), indicating there are
> # duplicate indices for the sparse matrix (then it adds the values there)
> newSparse <- Matrix::sparseMatrix(i = arrowDataset %>% pull(i) ,
> j = arrowDataset %>% pull(j),
> x = 1)
> unique(newSparse@x) # here is the bug, @x is the slot for values
> arrowInMemory <- arrowDataset %>% collect()
> # after loading in memory the output is never more than 1 no matter how
> # often I run it
> newSparse <- Matrix::sparseMatrix(i = arrowInMemory %>% pull(i) ,
> j = arrowInMemory %>% pull(j),
> x = 1)
> unique(newSparse@x){code}
--
This message was sent by Atlassian Jira
(v8.20.1#820001)