[
https://issues.apache.org/jira/browse/ARROW-15065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17462464#comment-17462464
]
Vibhatha Lakmal Abeykoon commented on ARROW-15065:
--------------------------------------------------
[~jorisvandenbossche] and [~westonpace]
I looked into the code to do this modification. Here I have a few questions
related to the functions that need to be exposed to Python.
As far as I understand, the following methods are not exposed to Python yet.
Please correct me if I am wrong.
```c++
/// Dictionary arrays will always be counted in their entirety
/// even if the array only references a portion of the dictionary.
Result<int64_t> ARROW_EXPORT ReferencedBufferSize(const ArrayData& array_data);
/// \brief Returns the sum of bytes from all buffer ranges referenced
/// \see ReferencedBufferSize(const ArrayData& array_data) for details
Result<int64_t> ARROW_EXPORT ReferencedBufferSize(const Array& array_data);
/// \brief Returns the sum of bytes from all buffer ranges referenced
/// \see ReferencedBufferSize(const ArrayData& array_data) for details
Result<int64_t> ARROW_EXPORT ReferencedBufferSize(const ChunkedArray&
array_data);
/// \brief Returns the sum of bytes from all buffer ranges referenced
/// \see ReferencedBufferSize(const ArrayData& array_data) for details
Result<int64_t> ARROW_EXPORT ReferencedBufferSize(const RecordBatch&
array_data);
/// \brief Returns the sum of bytes from all buffer ranges referenced
/// \see ReferencedBufferSize(const ArrayData& array_data) for details
Result<int64_t> ARROW_EXPORT ReferencedBufferSize(const Table& array_data);
```
Here "arrow::util::ReferencedBufferSize" methods needs to be included in Cython
bindings. Not quite sure what is the best place to put these into. None of the
members in this header `arrow/util/byte_size.h` is included in Cython. *What
should be a better place to put these methods?*
Secondly, each entity represented by these methods
* ArrayData
* Array
* ChunkedArray
* RecordBatch
* Table
Need a method called `get_buffer_size` or a property `buffer_size` in each API.
Since we only focus on the actual data buffer, the method name could be vital
for user to understand it clearly.
Suggestions: `data_buffer_size`, `buffer_size`
> [Python][R] Expose ReferencedBufferSize to python/R
> ---------------------------------------------------
>
> Key: ARROW-15065
> URL: https://issues.apache.org/jira/browse/ARROW-15065
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Python, R
> Reporter: Weston Pace
> Assignee: Vibhatha Lakmal Abeykoon
> Priority: Major
> Labels: good-first-issue
>
> This could be a method on arrays, chunked arrays, record batches, and tables.
> This method takes array offsets into account.
> We should probably add this alongside the existing nbytes field with clear
> commenting about the difference between the two of them. Both can be useful
> depending on the need.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)