wjones127 commented on code in PR #15101:
URL: https://github.com/apache/arrow/pull/15101#discussion_r1060973607
##########
cpp/src/parquet/arrow/reader_writer_benchmark.cc:
##########
@@ -197,6 +197,52 @@ BENCHMARK_TEMPLATE2(BM_WriteColumn, true, DoubleType);
BENCHMARK_TEMPLATE2(BM_WriteColumn, false, BooleanType);
BENCHMARK_TEMPLATE2(BM_WriteColumn, true, BooleanType);
+int32_t kInfiniteUniqueValues = -1;
+
+std::shared_ptr<::arrow::Table> RandomStringTable(int64_t length, int64_t
unique_values,
+ int64_t null_percentage) {
+ std::shared_ptr<::arrow::DataType> type = ::arrow::utf8();
+ std::shared_ptr<::arrow::Array> arr;
+ ::arrow::random::RandomArrayGenerator generator(500);
+ double null_probability = static_cast<double>(null_percentage) / 100.0;
+ if (unique_values == kInfiniteUniqueValues) {
+ arr = generator.String(length, /*min_length=*/3, /*max_length=*/32,
+ /*null_probability=*/null_probability);
+ } else {
+ arr = generator.StringWithRepeats(length, /*unique=*/unique_values,
+ /*min_length=*/3, /*max_length=*/32,
+ /*null_probability=*/null_probability);
+ }
+ return ::arrow::Table::Make(
+ ::arrow::schema({::arrow::field("column", type, null_percentage > 0)}),
{arr});
+}
+
+static void BM_WriteBinaryColumn(::benchmark::State& state) {
Review Comment:
I added a comment near the parameters of each benchmark, explaining we are
using the `unique_values` to trigger the code paths for dictionary and plain
encodings. I tried to add a test within the benchmark to validate we are
getting the expected encodings. But I found that it was too complicated, as the
encodings can change from page to page and also apply to the definition and
repetition levels (IIUC).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]