Re: nullptr for mutable data in pyarrow table from pandas

Antoine Pitrou Sat, 24 Apr 2021 06:09:33 -0700

It depends what that entails exactly? Would we introduce a conditionalin release mode:


  uint8_t* mutable_data() {
    return is_mutable() ? const_cast<uint8_t*>(data()) : nullptr;
  }

or would be always blindly return the data pointer, which seemsextremely dangerous to me:


  uint8_t* mutable_data() {
    return const_cast<uint8_t*>(data());
  }

I would be ok with the former.

Regards

Antoine.


Le 24/04/2021 à 15:05, Wes McKinney a écrit :

hi folks,

Thoughts about this? Since we already assert that is_mutable_ is true
in debug builds when accessing mutable_data_, using a const cast here
seems relatively benign, and then we can drop 8 bytes from the Buffer
struct

On Wed, Apr 21, 2021 at 10:10 AM Wes McKinney <[email protected]> wrote:


I'd be open to this simplification of the data structure, but I'll see
what others think

On Wed, Apr 21, 2021 at 9:26 AM Niranda Perera <[email protected]> wrote:


@Wes McKinney On a separate note, why would there be 2 pointers in the Buffer class, 
const uint8_t* data_ &  uint8_t* mutable_data_? Can't we have 1 pointer and then use 
const_cast<> in mutable_data() method (or vise versa)?

On Wed, Apr 21, 2021 at 10:09 AM Niranda Perera <[email protected]> 
wrote:


Sorry, that's exactly what you've mentioned in the jira. :-) Please ignore!

On Wed, Apr 21, 2021 at 10:07 AM Niranda Perera <[email protected]> 
wrote:


@Wes, @Antoine,
As @Weston pointed out, it seems like the issue is here.
https://github.com/apache/arrow/blob/37c27d1eaf0fa61281ad103c08a0251bb6883ec4/cpp/src/arrow/python/numpy_convert.cc#L51
When the numpy buffer's is_mutable_ marked as true, ideally, *mutable_data_ 
should have also been set IMO.

On Wed, Apr 21, 2021 at 10:00 AM Wes McKinney <[email protected]> wrote:


Definitely a bug. I just opened

https://issues.apache.org/jira/browse/ARROW-12495

If there's a chance to get this into 4.0.0 this would be a nice one
but I suspect the next RC is already under way (it need not block
since this bug has been present a long time)

On Wed, Apr 21, 2021 at 3:31 AM Antoine Pitrou <[email protected]> wrote:



It sounds like a bug if is_mutable_ is true but mutable_data_ is nullptr.

Regards

Antoine.


Le 21/04/2021 à 03:17, Weston Pace a écrit :

If it comes from pandas (and is eligible for zero-copy) then the
buffer implementation will be `NumPyBuffer`.  Printing one in GDB
yields...

```
$12 = {_vptr.Buffer = 0x7f0b66e147f8 <vtable for
arrow::py::NumPyBuffer+16>, is_mutable_ = true, is_cpu_ = true, data_
= 0x55b71f901a70 "\001", mutable_data_ = 0x0, size_ = 16, capacity_ =
16,
    parent_ = {<std::__shared_ptr<arrow::Buffer,
(__gnu_cxx::_Lock_policy)2>> =
{<std::__shared_ptr_access<arrow::Buffer, (__gnu_cxx::_Lock_policy)2,
false, false>> = {<No data fields>}, _M_ptr = 0x0,
        _M_refcount = {_M_pi = 0x0}}, <No data fields>},
    memory_manager_ = {<std::__shared_ptr<arrow::MemoryManager,
(__gnu_cxx::_Lock_policy)2>> =
{<std::__shared_ptr_access<arrow::MemoryManager,
(__gnu_cxx::_Lock_policy)2, false, false>> = {<No data fields>},
        _M_ptr = 0x55b71fdca4e0, _M_refcount = {_M_pi =
0x55b71fb90640}}, <No data fields>}}
```

Notice that `is_cpu_` and `is_mutable_` are both `true`.  It's maybe a
bug that `is_mutable_` is true.  Although maybe not as it appears to
be telling whether the underlying numpy buffer itself is mutable or
not...

```
if (PyArray_FLAGS(ndarray) & NPY_ARRAY_WRITEABLE) {
      is_mutable_ = true;
}
```


On Tue, Apr 20, 2021 at 2:15 PM Niranda Perera <[email protected]> wrote:


Hi all,

We have been using Arrow v2.0.0 and we encountered the following issue.

I was reading a table with numeric data using pandas.read_csv and then
converting it into pyarrow table. In our application (Cylon
<https://github.com/cylondata/cylon>), we are accessing this pyarrow table
from c++. We want to access the mutable data of the arrays in the pyarrow
table.

But the following returns a nullptr.
T *mutable_data = array->data()->GetMutableValues<T>(1); // returns nullptr

Interestingly,
array->data()->buffers[1]->IsMutable(); // returns true
array->data()->buffers[1]->IsCpu(); // returns true

This only happens when I use pandas df to create a pyarrow table. It
wouldn't happen when I use pyarrow.read_csv. So, I am guessing there's some
issue in the buffer creation from pandas df.

Is this an expected behavior? or has this been resolved in v2.0< releases?

Best
--
Niranda Perera
https://niranda.dev/
@n1r44 <https://twitter.com/N1R44>




--
Niranda Perera
https://niranda.dev/
@n1r44



--
Niranda Perera
https://niranda.dev/
@n1r44



--
Niranda Perera
https://niranda.dev/
@n1r44

Re: nullptr for mutable data in pyarrow table from pandas

Reply via email to