[GitHub] [arrow] zagto commented on a diff in pull request #13330: ARROW-16772: [C++] Implement encode and decode functions for Run-Length encoding

GitBox Mon, 22 Aug 2022 18:39:24 -0700


zagto commented on code in PR #13330:
URL: https://github.com/apache/arrow/pull/13330#discussion_r952061137



##########
cpp/src/arrow/array/array_encoded.cc:
##########
@@ -0,0 +1,77 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "arrow/array/array_encoded.h"
+#include "arrow/array/util.h"
+#include "arrow/util/logging.h"
+#include "arrow/util/rle_util.h"
+
+namespace arrow {
+
+// ----------------------------------------------------------------------
+// RunLengthEncodedArray
+
+RunLengthEncodedArray::RunLengthEncodedArray(const std::shared_ptr<ArrayData>& 
data) {
+  ARROW_CHECK_EQ(data->type->id(), Type::RUN_LENGTH_ENCODED);
+  SetData(data);
+}
+
+RunLengthEncodedArray::RunLengthEncodedArray(const std::shared_ptr<DataType>& 
type,
+                                             int64_t length,
+                                             const std::shared_ptr<Array>& 
run_ends_array,
+                                             const std::shared_ptr<Array>& 
values_array,
+                                             int64_t offset) {
+  ARROW_CHECK_EQ(type->id(), Type::RUN_LENGTH_ENCODED);
+  SetData(ArrayData::Make(type, length, {NULLPTR}, 0, offset));
+  data_->child_data.push_back(std::move(run_ends_array->data()));
+  data_->child_data.push_back(std::move(values_array->data()));

Review Comment:
   I don't think I have written that down anywhere, so I'll put it here:
   
   Yes having the values buffer(s) in the parent would makes supporting 
multiple data types more complicated:
   - Each type in arrow has a defined buffer layout, RLE would not. It would 
have different buffers depending on the type it encodes, i.e. a single values 
buffer for integer types, typeID+offset for dense union..., but also add a run 
ends buffer somewhere
   - we could not rely on things like Validate(), unary kernels, array 
builders, or anything else to work on the values array without modification
   - While in Arrow C++ the parent array has pointer to the exact DataType 
object, in the format itself an array has only a type ID, RLE in this case. The 
type of the values would have to be put into the buffers somewhere, making the 
layout even more complicated
   
   Having the run ends as a buffer in the parent is less bad, but also causes 
problems:
   - When a buffer is passed over the C bridge, we don't really know it's size 
(or at least how I understand it). It can be determined either from the length 
field, or a combination of that and other buffers easily for other types, but 
not for RLE. The RLE format is designed to allow O(log(n)) random access ,using 
binary search. For Binary search, we need an upper end, which we conveniently 
get from the length field of a child array (Yes the whole array needs to be 
sorted).
   - Also having a child array gives us another type field, which means the 
format could support RLE arrays with  run ends that are not int32 in the future.
   
   In general I like the idea that a user can simply run any analysis on the 
run ends and values array on thier own, if they find it useful for some reason. 
They could always construct an array from the buffers but it just seems less 
messy with separate arrays.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] zagto commented on a diff in pull request #13330: ARROW-16772: [C++] Implement encode and decode functions for Run-Length encoding

Reply via email to