pitrou commented on code in PR #47034:
URL: https://github.com/apache/arrow/pull/47034#discussion_r2298271773
##########
dev/archery/archery/integration/datagen.py:
##########
@@ -1157,7 +1164,7 @@ def _get_children(self):
]
def generate_column(self, size, name=None):
- values = self.values_field.generate_column(size)
+ values = self.values_field.generate_column(int(size/2))
run_ends = self.run_ends_field.generate_column(size)
if name is None:
name = self.name
Review Comment:
Ok, I'm now convinced that this is incorrect.
I think we should instead have something like:
```python
def generate_column(self, size, name=None):
num_run_ends = size // 4
values = self.values_field.generate_column(num_run_ends)
run_ends = self.run_ends_field.generate_column(num_run_ends, size)
if name is None:
name = self.name
return RunEndEncodedColumn(name, size, run_ends, values)
```
and in `RunEndsField`:
```python
def generate_column(self, size, logical_length, name=None):
assert logical_length < 2**(self.bit_width - 1)
rng = np.random.default_rng()
# Generate values that are strictly increasing with a min-value of 1.
# We sort the values to ensure they are strictly increasing and set
# replace to False to avoid duplicates, ensuring a valid run-ends
array.
values = rng.choice(logical_length - 1, size=size, replace=False)
values += 1
values = sorted(values)
values = list(map(int if self.bit_width < 64 else str, values))
# RunEnds cannot be null, as such self.nullable == False and this
# will generate a validity map of all ones.
is_valid = self._make_is_valid(size)
if name is None:
name = self.name
return PrimitiveColumn(name, size, is_valid, values)
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]