zagto commented on code in PR #13330: URL: https://github.com/apache/arrow/pull/13330#discussion_r952061137
########## cpp/src/arrow/array/array_encoded.cc: ########## @@ -0,0 +1,77 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "arrow/array/array_encoded.h" +#include "arrow/array/util.h" +#include "arrow/util/logging.h" +#include "arrow/util/rle_util.h" + +namespace arrow { + +// ---------------------------------------------------------------------- +// RunLengthEncodedArray + +RunLengthEncodedArray::RunLengthEncodedArray(const std::shared_ptr<ArrayData>& data) { + ARROW_CHECK_EQ(data->type->id(), Type::RUN_LENGTH_ENCODED); + SetData(data); +} + +RunLengthEncodedArray::RunLengthEncodedArray(const std::shared_ptr<DataType>& type, + int64_t length, + const std::shared_ptr<Array>& run_ends_array, + const std::shared_ptr<Array>& values_array, + int64_t offset) { + ARROW_CHECK_EQ(type->id(), Type::RUN_LENGTH_ENCODED); + SetData(ArrayData::Make(type, length, {NULLPTR}, 0, offset)); + data_->child_data.push_back(std::move(run_ends_array->data())); + data_->child_data.push_back(std::move(values_array->data())); Review Comment: I don't think I have written that down anywhere, so I'll put it here: Yes having the values buffer(s) in the parent would makes supporting multiple data types more complicated: - Each type in arrow has a defined buffer layout, RLE would not. It would have different buffers depending on the type it encodes, i.e. a single values buffer for integer types, typeID+offset for dense union..., but also add a run ends buffer somewhere - we could not rely on things like Validate(), unary kernels, array builders, or anything else to work on the values array without modification - While in Arrow C++ the parent array has pointer to the exact DataType object, in the format itself an array has only a type ID, RLE in this case. The type of the values would have to be put into the buffers somewhere, making the layout even more complicated Having the run ends as a buffer in the parent is less bad, but also causes problems: - When a buffer is passed over the C bridge, we don't really know it's size (or at least how I understand it). It can be determined either from the length field, or a combination of that and other buffers easily for other types, but not for RLE. The RLE format is designed to allow O(log(n)) random access ,using binary search. For Binary search, we need an upper end, which we conveniently get from the length field of a child array (Yes the whole array needs to be sorted). - Also having a child array gives us another type field, which means the format could support RLE arrays with run ends that are not int32 in the future. In general I like the idea that a user can simply run any analysis on the run ends and values array on thier own, if they find it useful for some reason. They could always construct an array from the buffers but it just seems less messy with separate arrays. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
