[
https://issues.apache.org/jira/browse/ARROW-11745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated ARROW-11745:
-----------------------------------
Labels: pull-request-available (was: )
> [C++] Improve configurability of random data generation
> -------------------------------------------------------
>
> Key: ARROW-11745
> URL: https://issues.apache.org/jira/browse/ARROW-11745
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Affects Versions: 3.0.0
> Reporter: Ben Kietzman
> Assignee: Ben Kietzman
> Priority: Major
> Labels: pull-request-available
> Fix For: 4.0.0
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> {{arrow::random::RandomArrayGenerator}} is useful for stress testing and
> benchmarking. Arrays of primitives can be generated with little boilerplate,
> however it is cumbersome to specify creation of nested arrays or record
> batches which are necessary for testing $n column operations such as group_by.
> My ideal API for random generation takes only a FieldVector, a number of
> rows, and a seed as arguments. Other options (such as minimum, maximum,
> unique count, null probability, etc) are specified using field metadata so
> that they can be provided uniformly or granularly as necessary for a given
> test case:
> {code:c++}
> auto random_batch = Generate({
> field("i32", int32()), // i32 may take any value between INT_MAX and INT_MIN
> // and will be null with default probability 0.01
> field("f32", float32(), false), // f32 will be entirely valid
> field("probability", float64(), true, key_value_metadata({
> // custom random generation properties:
> {"min", "0.0"},
> {"max", "1.0"},
> {"null_probability", "0.0001"},
> }),
> field("list_i32", list(
> field("item", int32(), true, key_value_metadata({
> // custom random generation properties can also be specified for nested
> fields:
> {"min", "0"},
> {"max", "1"},
> })
> )),
> }, num_rows, 0xdeadbeef);
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)