pitrou commented on code in PR #47034: URL: https://github.com/apache/arrow/pull/47034#discussion_r2298271773
########## dev/archery/archery/integration/datagen.py: ########## @@ -1157,7 +1164,7 @@ def _get_children(self): ] def generate_column(self, size, name=None): - values = self.values_field.generate_column(size) + values = self.values_field.generate_column(int(size/2)) run_ends = self.run_ends_field.generate_column(size) if name is None: name = self.name Review Comment: Ok, I'm now convinced that this is incorrect. I think we should instead have something like: ```python def generate_column(self, size, name=None): num_run_ends = size // 4 values = self.values_field.generate_column(num_run_ends) run_ends = self.run_ends_field.generate_column(num_run_ends, size) if name is None: name = self.name return RunEndEncodedColumn(name, size, run_ends, values) ``` and in `RunEndsField`: ```python def generate_column(self, size, logical_length, name=None): assert logical_length < 2**(self.bit_width - 1) rng = np.random.default_rng() # Generate values that are strictly increasing with a min-value of 1. # We sort the values to ensure they are strictly increasing and set # replace to False to avoid duplicates, ensuring a valid run-ends array. values = rng.choice(logical_length - 1, size=size, replace=False) values += 1 values = sorted(values) values = list(map(int if self.bit_width < 64 else str, values)) # RunEnds cannot be null, as such self.nullable == False and this # will generate a validity map of all ones. is_valid = self._make_is_valid(size) if name is None: name = self.name return PrimitiveColumn(name, size, is_valid, values) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org