SChakravorti21 commented on issue #34535:
URL: https://github.com/apache/arrow/issues/34535#issuecomment-1965371836
> For instance, when sorting a chunked array, we don't want to keep copying
that same set of shared_ptrs over and over.
Sorry if I wasn't clear - the `ChunkedArrayResolver` returns raw pointers to
`Array`s, not shared pointers, so it avoids the atomic reference counting
overhead. Do you feel like it would be acceptable to do it this way in
`ChunkResolver`? I was thinking we could do something like this (pseudo-code):
```cpp
/// BEGIN: Copied from `chunked_internal.h`
template <typename ArrayType>
struct ResolvedChunk {
using ViewType = GetViewType<typename ArrayType::TypeClass>;
using LogicalValueType = typename ViewType::T;
const ArrayType* array;
const int64_t index;
ResolvedChunk(const ArrayType* array, int64_t index) : array(array),
index(index) {}
bool IsNull() const { return array->IsNull(index); }
LogicalValueType Value() const { return
ViewType::LogicalValue(array->GetView(index)); }
};
template <>
struct ResolvedChunk<Array> {
const Array* array;
const int64_t index;
ResolvedChunk(const Array* array, int64_t index) : array(array),
index(index) {}
bool IsNull() const { return array->IsNull(index); }
};
/// END: Copied from `chunked_internal.h`
struct ChunkLocation {
int64_t chunk_index;
int64_t index_in_chunk;
};
template <typename ArrayType = Array>
class ChunkResolver {
public:
inline ResolvedChunk<ArrayType> Resolve(int64_t index) const;
inline ChunkLocation ResolveLocation(int64_t index) const;
inline ChunkLocation ResolveLocationWithHint(ChunkLocation
hint) const;
private:
std::vector<int64_t> offsets_;
mutable std::atomic<int64_t> cached_chunk_;
// This contains `shared_ptr`s to simplify usage for some applications (to
avoid
// having to hold onto the original Arrays), but we would only return raw
pointers
// to avoid the reference counting overhead of copying `shared_ptr`s.
std::vector<std::shared_ptr<Array>> chunks_;
};
```
I agree that maintaining the performance of `ChunkResolver` is absolutely
critical, and I think this solution avoids adding any unnecessary overhead for
people who choose to use `ResolveLocation` and `ResolveLocationWithHint`.
> That and the fact that you would have to type the functions based on all
the types that are used to represent an array of array chunks.
I'm sorry, I'm not quite sure I understood this part. Could you clarify?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]