Peter Wortmann created ARROW-14359:
--------------------------------------
Summary: NumpyBuffer computes size incorrectly for non-contiguous
arrays
Key: ARROW-14359
URL: https://issues.apache.org/jira/browse/ARROW-14359
Project: Apache Arrow
Issue Type: Bug
Components: Python
Affects Versions: 5.0.0
Reporter: Peter Wortmann
When wrapping a numpy array as an Arrow tensor, the underlying memory needs to
be wrapped using a NumpyBuffer. The size of that buffer is calculated as
follows:
{code:java}
size_ = PyArray_SIZE(ndarray) * PyArray_DESCR(ndarray)->elsize;
{code}
However, this is only correct for contiguous arrays - say, we do the following:
{code:java}
>>> import numpy,pyarrow
>>> arr = numpy.empty((10,10))
>>> pyarrow.Tensor.from_numpy(arr[:,:5])
<pyarrow.Tensor>
type: double
shape: (10, 5)
strides: (80, 8)
{code}
The underlying NumpyBuffer will have size 50*8=400 here, however going by shape
and stride the last row starts at offset 9*80 = 720.
This is normally pretty harmless, because the buffer size isn't really used for
anything. However, Tensor::CheckTensorStridesValidity will still object to this
("strides must not involve buffer over run"). This will happen when we try to
use Tensor::Make to create a new tensor based on the same underlying buffer.
The "correct" implementation here would likely be to do something similar to
CheckTensorStridesValidity, but with numpy flavour (untested!):
{code:java}
std::vector<npy_intp> last_index(shape);
for (int i = 0; i < ndarray->nd; ++i) {
last_index[i] = ndarray->dimensions[i]-1;
}
auto last_elem = reinterpret_cast<uint8_t*>(PyArray_GetPtr(ao,
last_index.data()));
size_ = last_elem - data_ + PyArray_DESCR(ndarray)->elsize;
{code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)