[
https://issues.apache.org/jira/browse/ARROW-14359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Peter Wortmann updated ARROW-14359:
-----------------------------------
Description:
When wrapping a numpy array as an Arrow tensor, the underlying memory needs to
be wrapped using a NumpyBuffer. The size of that buffer is calculated as
follows:
{code:cpp}
size_ = PyArray_SIZE(ndarray) * PyArray_DESCR(ndarray)->elsize;
{code}
However, this is only correct for contiguous arrays - say, we do the following:
{code:python}
>>> import numpy,pyarrow
>>> arr = numpy.empty((10,10))
>>> pyarrow.Tensor.from_numpy(arr[:,:5])
<pyarrow.Tensor>
type: double
shape: (10, 5)
strides: (80, 8)
{code}
The underlying NumpyBuffer will have size 50*8=400 here, however going by shape
and stride the last row starts at offset 9*80 = 720.
This is normally pretty harmless, because the buffer size isn't really used for
anything. However, Tensor::CheckTensorStridesValidity will still object to this
("strides must not involve buffer over run"). This will happen when we try to
use Tensor::Make to create a new tensor based on the same underlying buffer.
The "correct" implementation here would likely be to do something similar to
CheckTensorStridesValidity, but with numpy flavour (untested!):
{code:cpp}
std::vector<npy_intp> last_index(shape);
for (int i = 0; i < ndarray->nd; ++i) {
last_index[i] = ndarray->dimensions[i]-1;
}
auto last_elem = reinterpret_cast<uint8_t*>(PyArray_GetPtr(ao,
last_index.data()));
size_ = last_elem - data_ + PyArray_DESCR(ndarray)->elsize;
{code}
was:
When wrapping a numpy array as an Arrow tensor, the underlying memory needs to
be wrapped using a NumpyBuffer. The size of that buffer is calculated as
follows:
{code:java}
size_ = PyArray_SIZE(ndarray) * PyArray_DESCR(ndarray)->elsize;
{code}
However, this is only correct for contiguous arrays - say, we do the following:
{code:java}
>>> import numpy,pyarrow
>>> arr = numpy.empty((10,10))
>>> pyarrow.Tensor.from_numpy(arr[:,:5])
<pyarrow.Tensor>
type: double
shape: (10, 5)
strides: (80, 8)
{code}
The underlying NumpyBuffer will have size 50*8=400 here, however going by shape
and stride the last row starts at offset 9*80 = 720.
This is normally pretty harmless, because the buffer size isn't really used for
anything. However, Tensor::CheckTensorStridesValidity will still object to this
("strides must not involve buffer over run"). This will happen when we try to
use Tensor::Make to create a new tensor based on the same underlying buffer.
The "correct" implementation here would likely be to do something similar to
CheckTensorStridesValidity, but with numpy flavour (untested!):
{code:java}
std::vector<npy_intp> last_index(shape);
for (int i = 0; i < ndarray->nd; ++i) {
last_index[i] = ndarray->dimensions[i]-1;
}
auto last_elem = reinterpret_cast<uint8_t*>(PyArray_GetPtr(ao,
last_index.data()));
size_ = last_elem - data_ + PyArray_DESCR(ndarray)->elsize;
{code}
> NumpyBuffer computes size incorrectly for non-contiguous arrays
> ---------------------------------------------------------------
>
> Key: ARROW-14359
> URL: https://issues.apache.org/jira/browse/ARROW-14359
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 5.0.0
> Reporter: Peter Wortmann
> Priority: Major
>
> When wrapping a numpy array as an Arrow tensor, the underlying memory needs
> to be wrapped using a NumpyBuffer. The size of that buffer is calculated as
> follows:
> {code:cpp}
> size_ = PyArray_SIZE(ndarray) * PyArray_DESCR(ndarray)->elsize;
> {code}
> However, this is only correct for contiguous arrays - say, we do the
> following:
> {code:python}
> >>> import numpy,pyarrow
> >>> arr = numpy.empty((10,10))
> >>> pyarrow.Tensor.from_numpy(arr[:,:5])
> <pyarrow.Tensor>
> type: double
> shape: (10, 5)
> strides: (80, 8)
> {code}
> The underlying NumpyBuffer will have size 50*8=400 here, however going by
> shape and stride the last row starts at offset 9*80 = 720.
> This is normally pretty harmless, because the buffer size isn't really used
> for anything. However, Tensor::CheckTensorStridesValidity will still object
> to this ("strides must not involve buffer over run"). This will happen when
> we try to use Tensor::Make to create a new tensor based on the same
> underlying buffer.
> The "correct" implementation here would likely be to do something similar to
> CheckTensorStridesValidity, but with numpy flavour (untested!):
> {code:cpp}
> std::vector<npy_intp> last_index(shape);
> for (int i = 0; i < ndarray->nd; ++i) {
> last_index[i] = ndarray->dimensions[i]-1;
> }
> auto last_elem = reinterpret_cast<uint8_t*>(PyArray_GetPtr(ao,
> last_index.data()));
> size_ = last_elem - data_ + PyArray_DESCR(ndarray)->elsize;
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)