zeroshade commented on code in PR #14223:
URL: https://github.com/apache/arrow/pull/14223#discussion_r1091032960
##########
dev/archery/archery/integration/datagen.py:
##########
@@ -193,6 +193,28 @@ def generate_range(self, size, lower, upper, name=None,
return PrimitiveColumn(name, size, is_valid, values)
+# Integer field that fulfils the requirements for the run ends field of RLE.
+# The integers are positive and in a strictly increasing sequence
+class RunEndsField(IntegerField):
+ def __init__(self, name, bit_width, *, nullable=False,
+ metadata=None):
+ super().__init__(name, is_signed=True, bit_width=bit_width,
+ nullable=nullable, metadata=metadata, min_value=1)
+
+ def generate_range(self, size, lower, upper, name=None,
+ include_extremes=False):
+ # values = np.random.randint(lower, upper, size=size, dtype=np.int64)
+ rng = np.random.default_rng()
+ values = rng.choice(2 ** (self.bit_width - 1) - 1, size=size,
replace=False)
Review Comment:
it's perfectly legal to start run-ends on a non-zero value because they
represent the *ends* of a run as opposed to the way offsets work where you have
a start and end. So if you have an REE array to represent `[3, 3, 3, 4, 4, 4]`
you have two run_ends: `[3, 6]`. As the run-end should be the index of the
start of the next run.
While it's technically not *invalid*, it wouldn't make any sense for
run-ends to start on a 0 value honestly.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]