hi folks, Thoughts about this? Since we already assert that is_mutable_ is true in debug builds when accessing mutable_data_, using a const cast here seems relatively benign, and then we can drop 8 bytes from the Buffer struct
On Wed, Apr 21, 2021 at 10:10 AM Wes McKinney <[email protected]> wrote: > > I'd be open to this simplification of the data structure, but I'll see > what others think > > On Wed, Apr 21, 2021 at 9:26 AM Niranda Perera <[email protected]> > wrote: > > > > @Wes McKinney On a separate note, why would there be 2 pointers in the > > Buffer class, const uint8_t* data_ & uint8_t* mutable_data_? Can't we have > > 1 pointer and then use const_cast<> in mutable_data() method (or vise > > versa)? > > > > On Wed, Apr 21, 2021 at 10:09 AM Niranda Perera <[email protected]> > > wrote: > >> > >> Sorry, that's exactly what you've mentioned in the jira. :-) Please ignore! > >> > >> On Wed, Apr 21, 2021 at 10:07 AM Niranda Perera <[email protected]> > >> wrote: > >>> > >>> @Wes, @Antoine, > >>> As @Weston pointed out, it seems like the issue is here. > >>> https://github.com/apache/arrow/blob/37c27d1eaf0fa61281ad103c08a0251bb6883ec4/cpp/src/arrow/python/numpy_convert.cc#L51 > >>> When the numpy buffer's is_mutable_ marked as true, ideally, > >>> *mutable_data_ should have also been set IMO. > >>> > >>> On Wed, Apr 21, 2021 at 10:00 AM Wes McKinney <[email protected]> wrote: > >>>> > >>>> Definitely a bug. I just opened > >>>> > >>>> https://issues.apache.org/jira/browse/ARROW-12495 > >>>> > >>>> If there's a chance to get this into 4.0.0 this would be a nice one > >>>> but I suspect the next RC is already under way (it need not block > >>>> since this bug has been present a long time) > >>>> > >>>> On Wed, Apr 21, 2021 at 3:31 AM Antoine Pitrou <[email protected]> > >>>> wrote: > >>>> > > >>>> > > >>>> > It sounds like a bug if is_mutable_ is true but mutable_data_ is > >>>> > nullptr. > >>>> > > >>>> > Regards > >>>> > > >>>> > Antoine. > >>>> > > >>>> > > >>>> > Le 21/04/2021 à 03:17, Weston Pace a écrit : > >>>> > > If it comes from pandas (and is eligible for zero-copy) then the > >>>> > > buffer implementation will be `NumPyBuffer`. Printing one in GDB > >>>> > > yields... > >>>> > > > >>>> > > ``` > >>>> > > $12 = {_vptr.Buffer = 0x7f0b66e147f8 <vtable for > >>>> > > arrow::py::NumPyBuffer+16>, is_mutable_ = true, is_cpu_ = true, data_ > >>>> > > = 0x55b71f901a70 "\001", mutable_data_ = 0x0, size_ = 16, capacity_ = > >>>> > > 16, > >>>> > > parent_ = {<std::__shared_ptr<arrow::Buffer, > >>>> > > (__gnu_cxx::_Lock_policy)2>> = > >>>> > > {<std::__shared_ptr_access<arrow::Buffer, (__gnu_cxx::_Lock_policy)2, > >>>> > > false, false>> = {<No data fields>}, _M_ptr = 0x0, > >>>> > > _M_refcount = {_M_pi = 0x0}}, <No data fields>}, > >>>> > > memory_manager_ = {<std::__shared_ptr<arrow::MemoryManager, > >>>> > > (__gnu_cxx::_Lock_policy)2>> = > >>>> > > {<std::__shared_ptr_access<arrow::MemoryManager, > >>>> > > (__gnu_cxx::_Lock_policy)2, false, false>> = {<No data fields>}, > >>>> > > _M_ptr = 0x55b71fdca4e0, _M_refcount = {_M_pi = > >>>> > > 0x55b71fb90640}}, <No data fields>}} > >>>> > > ``` > >>>> > > > >>>> > > Notice that `is_cpu_` and `is_mutable_` are both `true`. It's maybe > >>>> > > a > >>>> > > bug that `is_mutable_` is true. Although maybe not as it appears to > >>>> > > be telling whether the underlying numpy buffer itself is mutable or > >>>> > > not... > >>>> > > > >>>> > > ``` > >>>> > > if (PyArray_FLAGS(ndarray) & NPY_ARRAY_WRITEABLE) { > >>>> > > is_mutable_ = true; > >>>> > > } > >>>> > > ``` > >>>> > > > >>>> > > > >>>> > > On Tue, Apr 20, 2021 at 2:15 PM Niranda Perera > >>>> > > <[email protected]> wrote: > >>>> > >> > >>>> > >> Hi all, > >>>> > >> > >>>> > >> We have been using Arrow v2.0.0 and we encountered the following > >>>> > >> issue. > >>>> > >> > >>>> > >> I was reading a table with numeric data using pandas.read_csv and > >>>> > >> then > >>>> > >> converting it into pyarrow table. In our application (Cylon > >>>> > >> <https://github.com/cylondata/cylon>), we are accessing this > >>>> > >> pyarrow table > >>>> > >> from c++. We want to access the mutable data of the arrays in the > >>>> > >> pyarrow > >>>> > >> table. > >>>> > >> > >>>> > >> But the following returns a nullptr. > >>>> > >> T *mutable_data = array->data()->GetMutableValues<T>(1); // returns > >>>> > >> nullptr > >>>> > >> > >>>> > >> Interestingly, > >>>> > >> array->data()->buffers[1]->IsMutable(); // returns true > >>>> > >> array->data()->buffers[1]->IsCpu(); // returns true > >>>> > >> > >>>> > >> This only happens when I use pandas df to create a pyarrow table. It > >>>> > >> wouldn't happen when I use pyarrow.read_csv. So, I am guessing > >>>> > >> there's some > >>>> > >> issue in the buffer creation from pandas df. > >>>> > >> > >>>> > >> Is this an expected behavior? or has this been resolved in v2.0< > >>>> > >> releases? > >>>> > >> > >>>> > >> Best > >>>> > >> -- > >>>> > >> Niranda Perera > >>>> > >> https://niranda.dev/ > >>>> > >> @n1r44 <https://twitter.com/N1R44> > >>> > >>> > >>> > >>> -- > >>> Niranda Perera > >>> https://niranda.dev/ > >>> @n1r44 > >>> > >> > >> > >> -- > >> Niranda Perera > >> https://niranda.dev/ > >> @n1r44 > >> > > > > > > -- > > Niranda Perera > > https://niranda.dev/ > > @n1r44 > >
