[jira] [Updated] (ARROW-14359) NumpyBuffer computes size incorrectly for non-contiguous arrays

Peter Wortmann (Jira) Sun, 17 Oct 2021 08:54:07 -0700


     [ 
https://issues.apache.org/jira/browse/ARROW-14359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Peter Wortmann updated ARROW-14359:
-----------------------------------
    Description: 
When wrapping a numpy array as an Arrow tensor, the underlying memory needs to 
be wrapped using a NumpyBuffer. The size of that buffer is calculated as 
follows:
{code:cpp}
    size_ = PyArray_SIZE(ndarray) * PyArray_DESCR(ndarray)->elsize;
{code}

However, this is only correct for contiguous arrays - say, we do the following:
{code:python}
>>> import numpy,pyarrow
>>> arr = numpy.empty((10,10))
>>> pyarrow.Tensor.from_numpy(arr[:,:5])
<pyarrow.Tensor>
type: double
shape: (10, 5)
strides: (80, 8)
{code}
The underlying NumpyBuffer will have size 50*8=400 here, however going by shape 
and stride  the last row starts at offset 9*80 = 720.

This is normally pretty harmless, because the buffer size isn't really used for 
anything. However, Tensor::CheckTensorStridesValidity will still object to this 
("strides must not involve buffer over run"). This will happen when we try to 
use Tensor::Make to create a new tensor based on the same underlying buffer.

The "correct" implementation here would likely be to do something similar to 
CheckTensorStridesValidity, but with numpy flavour (untested!):
{code:cpp}
  std::vector<npy_intp> last_index(shape);
  for (int i = 0; i < ndarray->nd; ++i) {
    last_index[i] = ndarray->dimensions[i]-1;
  }
  auto last_elem = reinterpret_cast<uint8_t*>(PyArray_GetPtr(ao, 
last_index.data()));
  size_ = last_elem - data_ + PyArray_DESCR(ndarray)->elsize;
{code}

  was:
When wrapping a numpy array as an Arrow tensor, the underlying memory needs to 
be wrapped using a NumpyBuffer. The size of that buffer is calculated as 
follows:

 
{code:java}
    size_ = PyArray_SIZE(ndarray) * PyArray_DESCR(ndarray)->elsize;
{code}
However, this is only correct for contiguous arrays - say, we do the following:

 
{code:java}
>>> import numpy,pyarrow
>>> arr = numpy.empty((10,10))
>>> pyarrow.Tensor.from_numpy(arr[:,:5])
<pyarrow.Tensor>
type: double
shape: (10, 5)
strides: (80, 8)
{code}
The underlying NumpyBuffer will have size 50*8=400 here, however going by shape 
and stride  the last row starts at offset 9*80 = 720.

 

This is normally pretty harmless, because the buffer size isn't really used for 
anything. However, Tensor::CheckTensorStridesValidity will still object to this 
("strides must not involve buffer over run"). This will happen when we try to 
use Tensor::Make to create a new tensor based on the same underlying buffer.

The "correct" implementation here would likely be to do something similar to 
CheckTensorStridesValidity, but with numpy flavour (untested!):

 
{code:java}
  std::vector<npy_intp> last_index(shape);
  for (int i = 0; i < ndarray->nd; ++i) {
    last_index[i] = ndarray->dimensions[i]-1;
  }
  auto last_elem = reinterpret_cast<uint8_t*>(PyArray_GetPtr(ao, 
last_index.data()));
  size_ = last_elem - data_ + PyArray_DESCR(ndarray)->elsize;
{code}
 

 

 


> NumpyBuffer computes size incorrectly for non-contiguous arrays
> ---------------------------------------------------------------
>
>                 Key: ARROW-14359
>                 URL: https://issues.apache.org/jira/browse/ARROW-14359
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 5.0.0
>            Reporter: Peter Wortmann
>            Priority: Major
>
> When wrapping a numpy array as an Arrow tensor, the underlying memory needs 
> to be wrapped using a NumpyBuffer. The size of that buffer is calculated as 
> follows:
> {code:cpp}
>     size_ = PyArray_SIZE(ndarray) * PyArray_DESCR(ndarray)->elsize;
> {code}
> However, this is only correct for contiguous arrays - say, we do the 
> following:
> {code:python}
> >>> import numpy,pyarrow
> >>> arr = numpy.empty((10,10))
> >>> pyarrow.Tensor.from_numpy(arr[:,:5])
> <pyarrow.Tensor>
> type: double
> shape: (10, 5)
> strides: (80, 8)
> {code}
> The underlying NumpyBuffer will have size 50*8=400 here, however going by 
> shape and stride  the last row starts at offset 9*80 = 720.
> This is normally pretty harmless, because the buffer size isn't really used 
> for anything. However, Tensor::CheckTensorStridesValidity will still object 
> to this ("strides must not involve buffer over run"). This will happen when 
> we try to use Tensor::Make to create a new tensor based on the same 
> underlying buffer.
> The "correct" implementation here would likely be to do something similar to 
> CheckTensorStridesValidity, but with numpy flavour (untested!):
> {code:cpp}
>   std::vector<npy_intp> last_index(shape);
>   for (int i = 0; i < ndarray->nd; ++i) {
>     last_index[i] = ndarray->dimensions[i]-1;
>   }
>   auto last_elem = reinterpret_cast<uint8_t*>(PyArray_GetPtr(ao, 
> last_index.data()));
>   size_ = last_elem - data_ + PyArray_DESCR(ndarray)->elsize;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-14359) NumpyBuffer computes size incorrectly for non-contiguous arrays

Reply via email to