I just opened https://issues.apache.org/jira/browse/ARROW-12530. This probably is an easy PR (we should add a unit test for the NumPyBuffer bug as part of it), but will have implications for 3rd party libraries that implement subclasses of Buffer because we're changing the base class members.
On Sat, Apr 24, 2021 at 8:37 AM Niranda Perera <[email protected]> wrote: > > +1 for former. > > On Sat, Apr 24, 2021 at 9:30 AM Wes McKinney <[email protected]> wrote: > > > Yes, I think the former would be acceptable. I don't think that anyone > > should be putting Buffer::mutable_data() on the inner loop of a hot > > path. > > > > On Sat, Apr 24, 2021 at 8:09 AM Antoine Pitrou <[email protected]> wrote: > > > > > > > > > It depends what that entails exactly? Would we introduce a conditional > > > in release mode: > > > > > > uint8_t* mutable_data() { > > > return is_mutable() ? const_cast<uint8_t*>(data()) : nullptr; > > > } > > > > > > or would be always blindly return the data pointer, which seems > > > extremely dangerous to me: > > > > > > uint8_t* mutable_data() { > > > return const_cast<uint8_t*>(data()); > > > } > > > > > > I would be ok with the former. > > > > > > Regards > > > > > > Antoine. > > > > > > > > > Le 24/04/2021 à 15:05, Wes McKinney a écrit : > > > > hi folks, > > > > > > > > Thoughts about this? Since we already assert that is_mutable_ is true > > > > in debug builds when accessing mutable_data_, using a const cast here > > > > seems relatively benign, and then we can drop 8 bytes from the Buffer > > > > struct > > > > > > > > On Wed, Apr 21, 2021 at 10:10 AM Wes McKinney <[email protected]> > > wrote: > > > >> > > > >> I'd be open to this simplification of the data structure, but I'll see > > > >> what others think > > > >> > > > >> On Wed, Apr 21, 2021 at 9:26 AM Niranda Perera < > > [email protected]> wrote: > > > >>> > > > >>> @Wes McKinney On a separate note, why would there be 2 pointers in > > the Buffer class, const uint8_t* data_ & uint8_t* mutable_data_? Can't we > > have 1 pointer and then use const_cast<> in mutable_data() method (or vise > > versa)? > > > >>> > > > >>> On Wed, Apr 21, 2021 at 10:09 AM Niranda Perera < > > [email protected]> wrote: > > > >>>> > > > >>>> Sorry, that's exactly what you've mentioned in the jira. :-) Please > > ignore! > > > >>>> > > > >>>> On Wed, Apr 21, 2021 at 10:07 AM Niranda Perera < > > [email protected]> wrote: > > > >>>>> > > > >>>>> @Wes, @Antoine, > > > >>>>> As @Weston pointed out, it seems like the issue is here. > > > >>>>> > > https://github.com/apache/arrow/blob/37c27d1eaf0fa61281ad103c08a0251bb6883ec4/cpp/src/arrow/python/numpy_convert.cc#L51 > > > >>>>> When the numpy buffer's is_mutable_ marked as true, ideally, > > *mutable_data_ should have also been set IMO. > > > >>>>> > > > >>>>> On Wed, Apr 21, 2021 at 10:00 AM Wes McKinney <[email protected]> > > wrote: > > > >>>>>> > > > >>>>>> Definitely a bug. I just opened > > > >>>>>> > > > >>>>>> https://issues.apache.org/jira/browse/ARROW-12495 > > > >>>>>> > > > >>>>>> If there's a chance to get this into 4.0.0 this would be a nice > > one > > > >>>>>> but I suspect the next RC is already under way (it need not block > > > >>>>>> since this bug has been present a long time) > > > >>>>>> > > > >>>>>> On Wed, Apr 21, 2021 at 3:31 AM Antoine Pitrou < > > [email protected]> wrote: > > > >>>>>>> > > > >>>>>>> > > > >>>>>>> It sounds like a bug if is_mutable_ is true but mutable_data_ is > > nullptr. > > > >>>>>>> > > > >>>>>>> Regards > > > >>>>>>> > > > >>>>>>> Antoine. > > > >>>>>>> > > > >>>>>>> > > > >>>>>>> Le 21/04/2021 à 03:17, Weston Pace a écrit : > > > >>>>>>>> If it comes from pandas (and is eligible for zero-copy) then the > > > >>>>>>>> buffer implementation will be `NumPyBuffer`. Printing one in > > GDB > > > >>>>>>>> yields... > > > >>>>>>>> > > > >>>>>>>> ``` > > > >>>>>>>> $12 = {_vptr.Buffer = 0x7f0b66e147f8 <vtable for > > > >>>>>>>> arrow::py::NumPyBuffer+16>, is_mutable_ = true, is_cpu_ = true, > > data_ > > > >>>>>>>> = 0x55b71f901a70 "\001", mutable_data_ = 0x0, size_ = 16, > > capacity_ = > > > >>>>>>>> 16, > > > >>>>>>>> parent_ = {<std::__shared_ptr<arrow::Buffer, > > > >>>>>>>> (__gnu_cxx::_Lock_policy)2>> = > > > >>>>>>>> {<std::__shared_ptr_access<arrow::Buffer, > > (__gnu_cxx::_Lock_policy)2, > > > >>>>>>>> false, false>> = {<No data fields>}, _M_ptr = 0x0, > > > >>>>>>>> _M_refcount = {_M_pi = 0x0}}, <No data fields>}, > > > >>>>>>>> memory_manager_ = {<std::__shared_ptr<arrow::MemoryManager, > > > >>>>>>>> (__gnu_cxx::_Lock_policy)2>> = > > > >>>>>>>> {<std::__shared_ptr_access<arrow::MemoryManager, > > > >>>>>>>> (__gnu_cxx::_Lock_policy)2, false, false>> = {<No data fields>}, > > > >>>>>>>> _M_ptr = 0x55b71fdca4e0, _M_refcount = {_M_pi = > > > >>>>>>>> 0x55b71fb90640}}, <No data fields>}} > > > >>>>>>>> ``` > > > >>>>>>>> > > > >>>>>>>> Notice that `is_cpu_` and `is_mutable_` are both `true`. It's > > maybe a > > > >>>>>>>> bug that `is_mutable_` is true. Although maybe not as it > > appears to > > > >>>>>>>> be telling whether the underlying numpy buffer itself is > > mutable or > > > >>>>>>>> not... > > > >>>>>>>> > > > >>>>>>>> ``` > > > >>>>>>>> if (PyArray_FLAGS(ndarray) & NPY_ARRAY_WRITEABLE) { > > > >>>>>>>> is_mutable_ = true; > > > >>>>>>>> } > > > >>>>>>>> ``` > > > >>>>>>>> > > > >>>>>>>> > > > >>>>>>>> On Tue, Apr 20, 2021 at 2:15 PM Niranda Perera < > > [email protected]> wrote: > > > >>>>>>>>> > > > >>>>>>>>> Hi all, > > > >>>>>>>>> > > > >>>>>>>>> We have been using Arrow v2.0.0 and we encountered the > > following issue. > > > >>>>>>>>> > > > >>>>>>>>> I was reading a table with numeric data using pandas.read_csv > > and then > > > >>>>>>>>> converting it into pyarrow table. In our application (Cylon > > > >>>>>>>>> <https://github.com/cylondata/cylon>), we are accessing this > > pyarrow table > > > >>>>>>>>> from c++. We want to access the mutable data of the arrays in > > the pyarrow > > > >>>>>>>>> table. > > > >>>>>>>>> > > > >>>>>>>>> But the following returns a nullptr. > > > >>>>>>>>> T *mutable_data = array->data()->GetMutableValues<T>(1); // > > returns nullptr > > > >>>>>>>>> > > > >>>>>>>>> Interestingly, > > > >>>>>>>>> array->data()->buffers[1]->IsMutable(); // returns true > > > >>>>>>>>> array->data()->buffers[1]->IsCpu(); // returns true > > > >>>>>>>>> > > > >>>>>>>>> This only happens when I use pandas df to create a pyarrow > > table. It > > > >>>>>>>>> wouldn't happen when I use pyarrow.read_csv. So, I am guessing > > there's some > > > >>>>>>>>> issue in the buffer creation from pandas df. > > > >>>>>>>>> > > > >>>>>>>>> Is this an expected behavior? or has this been resolved in > > v2.0< releases? > > > >>>>>>>>> > > > >>>>>>>>> Best > > > >>>>>>>>> -- > > > >>>>>>>>> Niranda Perera > > > >>>>>>>>> https://niranda.dev/ > > > >>>>>>>>> @n1r44 <https://twitter.com/N1R44> > > > >>>>> > > > >>>>> > > > >>>>> > > > >>>>> -- > > > >>>>> Niranda Perera > > > >>>>> https://niranda.dev/ > > > >>>>> @n1r44 > > > >>>>> > > > >>>> > > > >>>> > > > >>>> -- > > > >>>> Niranda Perera > > > >>>> https://niranda.dev/ > > > >>>> @n1r44 > > > >>>> > > > >>> > > > >>> > > > >>> -- > > > >>> Niranda Perera > > > >>> https://niranda.dev/ > > > >>> @n1r44 > > > >>> > > > > > -- > Niranda Perera > https://niranda.dev/ > @n1r44 <https://twitter.com/N1R44>
