hi folks,

Thoughts about this? Since we already assert that is_mutable_ is true
in debug builds when accessing mutable_data_, using a const cast here
seems relatively benign, and then we can drop 8 bytes from the Buffer
struct

On Wed, Apr 21, 2021 at 10:10 AM Wes McKinney <[email protected]> wrote:
>
> I'd be open to this simplification of the data structure, but I'll see
> what others think
>
> On Wed, Apr 21, 2021 at 9:26 AM Niranda Perera <[email protected]> 
> wrote:
> >
> > @Wes McKinney On a separate note, why would there be 2 pointers in the 
> > Buffer class, const uint8_t* data_ &  uint8_t* mutable_data_? Can't we have 
> > 1 pointer and then use const_cast<> in mutable_data() method (or vise 
> > versa)?
> >
> > On Wed, Apr 21, 2021 at 10:09 AM Niranda Perera <[email protected]> 
> > wrote:
> >>
> >> Sorry, that's exactly what you've mentioned in the jira. :-) Please ignore!
> >>
> >> On Wed, Apr 21, 2021 at 10:07 AM Niranda Perera <[email protected]> 
> >> wrote:
> >>>
> >>> @Wes, @Antoine,
> >>> As @Weston pointed out, it seems like the issue is here.
> >>> https://github.com/apache/arrow/blob/37c27d1eaf0fa61281ad103c08a0251bb6883ec4/cpp/src/arrow/python/numpy_convert.cc#L51
> >>> When the numpy buffer's is_mutable_ marked as true, ideally, 
> >>> *mutable_data_ should have also been set IMO.
> >>>
> >>> On Wed, Apr 21, 2021 at 10:00 AM Wes McKinney <[email protected]> wrote:
> >>>>
> >>>> Definitely a bug. I just opened
> >>>>
> >>>> https://issues.apache.org/jira/browse/ARROW-12495
> >>>>
> >>>> If there's a chance to get this into 4.0.0 this would be a nice one
> >>>> but I suspect the next RC is already under way (it need not block
> >>>> since this bug has been present a long time)
> >>>>
> >>>> On Wed, Apr 21, 2021 at 3:31 AM Antoine Pitrou <[email protected]> 
> >>>> wrote:
> >>>> >
> >>>> >
> >>>> > It sounds like a bug if is_mutable_ is true but mutable_data_ is 
> >>>> > nullptr.
> >>>> >
> >>>> > Regards
> >>>> >
> >>>> > Antoine.
> >>>> >
> >>>> >
> >>>> > Le 21/04/2021 à 03:17, Weston Pace a écrit :
> >>>> > > If it comes from pandas (and is eligible for zero-copy) then the
> >>>> > > buffer implementation will be `NumPyBuffer`.  Printing one in GDB
> >>>> > > yields...
> >>>> > >
> >>>> > > ```
> >>>> > > $12 = {_vptr.Buffer = 0x7f0b66e147f8 <vtable for
> >>>> > > arrow::py::NumPyBuffer+16>, is_mutable_ = true, is_cpu_ = true, data_
> >>>> > > = 0x55b71f901a70 "\001", mutable_data_ = 0x0, size_ = 16, capacity_ =
> >>>> > > 16,
> >>>> > >    parent_ = {<std::__shared_ptr<arrow::Buffer,
> >>>> > > (__gnu_cxx::_Lock_policy)2>> =
> >>>> > > {<std::__shared_ptr_access<arrow::Buffer, (__gnu_cxx::_Lock_policy)2,
> >>>> > > false, false>> = {<No data fields>}, _M_ptr = 0x0,
> >>>> > >        _M_refcount = {_M_pi = 0x0}}, <No data fields>},
> >>>> > >    memory_manager_ = {<std::__shared_ptr<arrow::MemoryManager,
> >>>> > > (__gnu_cxx::_Lock_policy)2>> =
> >>>> > > {<std::__shared_ptr_access<arrow::MemoryManager,
> >>>> > > (__gnu_cxx::_Lock_policy)2, false, false>> = {<No data fields>},
> >>>> > >        _M_ptr = 0x55b71fdca4e0, _M_refcount = {_M_pi =
> >>>> > > 0x55b71fb90640}}, <No data fields>}}
> >>>> > > ```
> >>>> > >
> >>>> > > Notice that `is_cpu_` and `is_mutable_` are both `true`.  It's maybe 
> >>>> > > a
> >>>> > > bug that `is_mutable_` is true.  Although maybe not as it appears to
> >>>> > > be telling whether the underlying numpy buffer itself is mutable or
> >>>> > > not...
> >>>> > >
> >>>> > > ```
> >>>> > > if (PyArray_FLAGS(ndarray) & NPY_ARRAY_WRITEABLE) {
> >>>> > >      is_mutable_ = true;
> >>>> > > }
> >>>> > > ```
> >>>> > >
> >>>> > >
> >>>> > > On Tue, Apr 20, 2021 at 2:15 PM Niranda Perera 
> >>>> > > <[email protected]> wrote:
> >>>> > >>
> >>>> > >> Hi all,
> >>>> > >>
> >>>> > >> We have been using Arrow v2.0.0 and we encountered the following 
> >>>> > >> issue.
> >>>> > >>
> >>>> > >> I was reading a table with numeric data using pandas.read_csv and 
> >>>> > >> then
> >>>> > >> converting it into pyarrow table. In our application (Cylon
> >>>> > >> <https://github.com/cylondata/cylon>), we are accessing this 
> >>>> > >> pyarrow table
> >>>> > >> from c++. We want to access the mutable data of the arrays in the 
> >>>> > >> pyarrow
> >>>> > >> table.
> >>>> > >>
> >>>> > >> But the following returns a nullptr.
> >>>> > >> T *mutable_data = array->data()->GetMutableValues<T>(1); // returns 
> >>>> > >> nullptr
> >>>> > >>
> >>>> > >> Interestingly,
> >>>> > >> array->data()->buffers[1]->IsMutable(); // returns true
> >>>> > >> array->data()->buffers[1]->IsCpu(); // returns true
> >>>> > >>
> >>>> > >> This only happens when I use pandas df to create a pyarrow table. It
> >>>> > >> wouldn't happen when I use pyarrow.read_csv. So, I am guessing 
> >>>> > >> there's some
> >>>> > >> issue in the buffer creation from pandas df.
> >>>> > >>
> >>>> > >> Is this an expected behavior? or has this been resolved in v2.0< 
> >>>> > >> releases?
> >>>> > >>
> >>>> > >> Best
> >>>> > >> --
> >>>> > >> Niranda Perera
> >>>> > >> https://niranda.dev/
> >>>> > >> @n1r44 <https://twitter.com/N1R44>
> >>>
> >>>
> >>>
> >>> --
> >>> Niranda Perera
> >>> https://niranda.dev/
> >>> @n1r44
> >>>
> >>
> >>
> >> --
> >> Niranda Perera
> >> https://niranda.dev/
> >> @n1r44
> >>
> >
> >
> > --
> > Niranda Perera
> > https://niranda.dev/
> > @n1r44
> >

Reply via email to