zagto commented on code in PR #14223:
URL: https://github.com/apache/arrow/pull/14223#discussion_r1091076673
##########
dev/archery/archery/integration/datagen.py:
##########
@@ -193,6 +193,28 @@ def generate_range(self, size, lower, upper, name=None,
return PrimitiveColumn(name, size, is_valid, values)
+# Integer field that fulfils the requirements for the run ends field of RLE.
+# The integers are positive and in a strictly increasing sequence
+class RunEndsField(IntegerField):
+ def __init__(self, name, bit_width, *, nullable=False,
+ metadata=None):
+ super().__init__(name, is_signed=True, bit_width=bit_width,
+ nullable=nullable, metadata=metadata, min_value=1)
+
+ def generate_range(self, size, lower, upper, name=None,
+ include_extremes=False):
+ # values = np.random.randint(lower, upper, size=size, dtype=np.int64)
+ rng = np.random.default_rng()
+ values = rng.choice(2 ** (self.bit_width - 1) - 1, size=size,
replace=False)
Review Comment:
```suggestion
values = rng.choice(2 ** (self.bit_width - 1) - 1, size=size,
replace=False)
values += 1
```
> A run must have have a length of at least 1. This means the values in the
run ends array all are positive and in strictly ascending order.
In the spec, we explicitly forbid zero-length runs.
I think what is missing here is increasing each array value by 1. The range
end argument passed to rng.choice already contains `- 1`. Since the argument to
rng.choice is defined as the first value that is not possible, the largest
value we generate currently is `2 ** (self.bit_width - 1) - 2`, which seems one
to low
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]