It sounds like this is a) a lot of work to do initially and b) a lot of work to maintain as the thrift data structures evolve.
It seems like benchmarking at that granularity might not be worth the hassle. It sounds like the lower-level microbenchmark you've added is maybe simpler. It could be very worthwhile to benchmark the combined planning + scheduling process, since that would presumably require less plumbing. On Fri, Nov 11, 2016 at 4:49 AM, Lars Volker <[email protected]> wrote: > Hi all, > > Here is a change <https://gerrit.cloudera.org/4554> that implements a > benchmark for SimpleScheduler::ComputeScanRangeAssigment() to address > IMPALA-4086 <https://issues.cloudera.org/browse/IMPALA-4086>. > > I would like to discuss whether it is possible to run the benchmark against > the Schedule() method instead. This would require changes to the scheduler > test utility classes in simple-scheduler-test-util.h to create a > TQueryExecRequest message suitable for calling Schedule(). > > Currently we compute these fields before calling > ComputeScanRangeAssignment(), which are basically what is contained in a > single plan node. > > BackendConfig > > vector<TScanRangeLocations> > > vector<TNetworkAddress> > > TQueryOptions > > > To build a schedule object we need to build a TQueryExecRequest, which has > 14 fields. The complex ones are: > > optional Descriptors.TDescriptorTable desc_tbl > > optional list<Planner.TPlanFragment> fragments > > optional list<i32> dest_fragment_idx > > optional map<Types.TPlanNodeId, list<Planner.TScanRangeLocations>> > > per_node_scan_ranges > > optional list<TPlanExecInfo> mt_plan_exec_info > > optional Results.TResultSetMetadata result_set_metadata > > optional TFinalizeParams finalize_params > > required ImpalaInternalService.TQueryCtx query_ctx > > optional string query_plan > > required list<Types.TNetworkAddress> host_list > > optional LineageGraph.TLineageGraph lineage_graph > > > Some of these members have other dependencies, for example the fragments > have the plan inside, which has all plan nodes: > > TQueryExecRequest: > > list<Planner.TPlanFragment> fragments > > partition.type > > plan.nodes[node_id] > > node_id (for dcheck) > > node.hdfs_scan_node (can be unset) > > idx (for sorting in query-schedule) > > TQueryCtx query_ctx (only for query options, which we already have) > > > I think it makes sense to benchmark ComputeScanRangeAssignment() in > isolation, since its implementation is reasonably complex, i.e. not just > linear in the input size. In order to benchmark Schedule(), we should first > consider writing proper unit tests for the SimpleScheduler and extend the > test utility code where necessary to do so. > > I curious for any feedback. Thanks, Lars >
