[ 
https://issues.apache.org/jira/browse/ARROW-14359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Wortmann updated ARROW-14359:
-----------------------------------
    Priority: Trivial  (was: Major)

> NumpyBuffer computes size incorrectly for non-contiguous arrays
> ---------------------------------------------------------------
>
>                 Key: ARROW-14359
>                 URL: https://issues.apache.org/jira/browse/ARROW-14359
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 5.0.0
>            Reporter: Peter Wortmann
>            Priority: Trivial
>
> When wrapping a numpy array as an Arrow tensor, the underlying memory needs 
> to be wrapped using a NumpyBuffer. The size of that buffer is calculated as 
> follows:
> {code:cpp}
>     size_ = PyArray_SIZE(ndarray) * PyArray_DESCR(ndarray)->elsize;
> {code}
> However, this is only correct for contiguous arrays - say, we do the 
> following:
> {code:python}
> >>> import numpy,pyarrow
> >>> arr = numpy.empty((10,10))
> >>> pyarrow.Tensor.from_numpy(arr[:,:5])
> <pyarrow.Tensor>
> type: double
> shape: (10, 5)
> strides: (80, 8)
> {code}
> The underlying NumpyBuffer will have size 50*8=400 here, however going by 
> shape and strideĀ  the last row starts at offset 9*80 = 720.
> This is normally pretty harmless, because the buffer size isn't really used 
> for anything. However, Tensor::CheckTensorStridesValidity will still object 
> to this ("strides must not involve buffer over run"). This will happen when 
> we try to use Tensor::Make to create a new tensor based on the same 
> underlying buffer.
> The "correct" implementation here would likely be to do something similar to 
> CheckTensorStridesValidity, but with numpy flavour (untested!):
> {code:cpp}
>   std::vector<npy_intp> last_index(shape);
>   for (int i = 0; i < ndarray->nd; ++i) {
>     last_index[i] = ndarray->dimensions[i]-1;
>   }
>   auto last_elem = reinterpret_cast<uint8_t*>(PyArray_GetPtr(ao, 
> last_index.data()));
>   size_ = last_elem - data_ + PyArray_DESCR(ndarray)->elsize;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to