Peter Wortmann created ARROW-14359:
--------------------------------------

             Summary: NumpyBuffer computes size incorrectly for non-contiguous 
arrays
                 Key: ARROW-14359
                 URL: https://issues.apache.org/jira/browse/ARROW-14359
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 5.0.0
            Reporter: Peter Wortmann


When wrapping a numpy array as an Arrow tensor, the underlying memory needs to 
be wrapped using a NumpyBuffer. The size of that buffer is calculated as 
follows:

 
{code:java}
    size_ = PyArray_SIZE(ndarray) * PyArray_DESCR(ndarray)->elsize;
{code}
However, this is only correct for contiguous arrays - say, we do the following:

 
{code:java}
>>> import numpy,pyarrow
>>> arr = numpy.empty((10,10))
>>> pyarrow.Tensor.from_numpy(arr[:,:5])
<pyarrow.Tensor>
type: double
shape: (10, 5)
strides: (80, 8)
{code}
The underlying NumpyBuffer will have size 50*8=400 here, however going by shape 
and stride  the last row starts at offset 9*80 = 720.

 

This is normally pretty harmless, because the buffer size isn't really used for 
anything. However, Tensor::CheckTensorStridesValidity will still object to this 
("strides must not involve buffer over run"). This will happen when we try to 
use Tensor::Make to create a new tensor based on the same underlying buffer.

The "correct" implementation here would likely be to do something similar to 
CheckTensorStridesValidity, but with numpy flavour (untested!):

 
{code:java}
  std::vector<npy_intp> last_index(shape);
  for (int i = 0; i < ndarray->nd; ++i) {
    last_index[i] = ndarray->dimensions[i]-1;
  }
  auto last_elem = reinterpret_cast<uint8_t*>(PyArray_GetPtr(ao, 
last_index.data()));
  size_ = last_elem - data_ + PyArray_DESCR(ndarray)->elsize;
{code}
 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to