jorisvandenbossche commented on a change in pull request #11993:
URL: https://github.com/apache/arrow/pull/11993#discussion_r786528062



##########
File path: python/pyarrow/array.pxi
##########
@@ -988,13 +988,51 @@ cdef class Array(_PandasConvertible):
     def nbytes(self):
         """
         Total number of bytes consumed by the elements of the array.
+
+        In other words, the sum of bytes from all buffer 
+        ranges referenced.
+
+        Unlike `get_total_buffer_size` this method will account for array
+        offsets.
+
+        If buffers are shared between arrays then the shared
+        portion will be counted multiple times.
+
+        The dictionary of dictionary arrays will always be counted in their 
+        entirety even if the array only references a portion of the dictionary.
         """
-        size = 0
-        for buf in self.buffers():
-            if buf is not None:
-                size += buf.size
+        cdef:
+            shared_ptr[CArray] shd_ptr_c_array
+            CArray *c_array
+            CResult[int64_t] c_res_buffer
+
+        shd_ptr_c_array = pyarrow_unwrap_array(self)
+        c_array = shd_ptr_c_array.get()
+        c_res_buffer = ReferencedBufferSize(deref(c_array))
+        size = GetResultValue(c_res_buffer)
         return size
 
+    def get_total_buffer_size(self):
+        """
+        The sum of bytes in each buffer referenced by the array.
+
+        An array may only reference a portion of a buffer.
+        This method will overestimate in this case and return the
+        byte size of the entire buffer.
+
+        If a buffer is referenced multiple times then it will
+        only be counted once.
+        """
+        cdef:
+            shared_ptr[CArray] shd_ptr_c_array
+            CArray *c_array
+            int64_t total_buffer_size
+
+        shd_ptr_c_array = pyarrow_unwrap_array(self)
+        c_array = shd_ptr_c_array.get()
+        total_buffer_size = TotalBufferSize(c_array[0])

Review comment:
       ```suggestion
           total_buffer_size = TotalBufferSize(deref(c_array))
   ```
   
   (like above?)

##########
File path: python/pyarrow/array.pxi
##########
@@ -988,13 +988,51 @@ cdef class Array(_PandasConvertible):
     def nbytes(self):
         """
         Total number of bytes consumed by the elements of the array.
+
+        In other words, the sum of bytes from all buffer 
+        ranges referenced.
+
+        Unlike `get_total_buffer_size` this method will account for array
+        offsets.
+
+        If buffers are shared between arrays then the shared
+        portion will be counted multiple times.
+
+        The dictionary of dictionary arrays will always be counted in their 
+        entirety even if the array only references a portion of the dictionary.
         """
-        size = 0
-        for buf in self.buffers():
-            if buf is not None:
-                size += buf.size
+        cdef:
+            shared_ptr[CArray] shd_ptr_c_array
+            CArray *c_array
+            CResult[int64_t] c_res_buffer
+
+        shd_ptr_c_array = pyarrow_unwrap_array(self)
+        c_array = shd_ptr_c_array.get()

Review comment:
       So I _think_ you can simplify this to `c_res_buffer = 
ReferencedBufferSize(deref(self.ap))`

##########
File path: python/pyarrow/tests/test_array.py
##########
@@ -2469,16 +2469,30 @@ def test_buffers_nested():
     assert struct.unpack('4xh', values) == (43,)
 
 
-def test_nbytes_sizeof():
+def test_total_buffer_size():
     a = pa.array(np.array([4, 5, 6], dtype='int64'))
-    assert a.nbytes == 8 * 3
+    assert a.get_total_buffer_size() == 8 * 3

Review comment:
       Can you keep both `assert a.nbytes == ..` and `assert 
a.get_total_buffer_size() == ..` here (and same for the ones below)

##########
File path: python/pyarrow/array.pxi
##########
@@ -988,13 +988,51 @@ cdef class Array(_PandasConvertible):
     def nbytes(self):
         """
         Total number of bytes consumed by the elements of the array.
+
+        In other words, the sum of bytes from all buffer 
+        ranges referenced.
+
+        Unlike `get_total_buffer_size` this method will account for array
+        offsets.
+
+        If buffers are shared between arrays then the shared
+        portion will be counted multiple times.
+
+        The dictionary of dictionary arrays will always be counted in their 
+        entirety even if the array only references a portion of the dictionary.
         """
-        size = 0
-        for buf in self.buffers():
-            if buf is not None:
-                size += buf.size
+        cdef:
+            shared_ptr[CArray] shd_ptr_c_array
+            CArray *c_array
+            CResult[int64_t] c_res_buffer
+
+        shd_ptr_c_array = pyarrow_unwrap_array(self)
+        c_array = shd_ptr_c_array.get()

Review comment:
       To get the C++ pointer, you can actually use `self.sp_array` or 
`self.ap` attributes, instead of calling `pyarrow_unwrap_array` on `self` (that 
method is only used for eg variables passed to a method)

##########
File path: python/pyarrow/table.pxi
##########
@@ -145,12 +145,50 @@ cdef class ChunkedArray(_PandasConvertible):
     def nbytes(self):
         """
         Total number of bytes consumed by the elements of the chunked array.
+
+        In other words, the sum of bytes from all buffer ranges referenced.
+
+        Unlike `get_total_buffer_size` this method will account for array
+        offsets.
+
+        If buffers are shared between arrays then the shared
+        portion will only be counted multiple times.
+
+        The dictionary of dictionary arrays will always be counted in their 
+        entirety even if the array only references a portion of the dictionary.
         """
-        size = 0
-        for chunk in self.iterchunks():
-            size += chunk.nbytes
+        cdef:
+            shared_ptr[CChunkedArray] shd_ptr_c_array
+            CChunkedArray *c_array
+            CResult[int64_t] c_res_buffer
+
+        shd_ptr_c_array = pyarrow_unwrap_chunked_array(self)
+        c_array = shd_ptr_c_array.get()
+        c_res_buffer = ReferencedBufferSize(deref(c_array))

Review comment:
       Similar comment here about that this can be simplified (no 
`pyarrow_unwrap_chunked_array(self)` needed)), just for each class the exact 
attribute name to use might be different.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to