Thank you Tim for your response. After talking to Marcel in person I filed https://issues.cloudera.org/browse/IMPALA-4496 to track this effort separately.
On Tue, Nov 15, 2016 at 4:14 PM, Tim Armstrong <[email protected]> wrote: > It sounds like this is a) a lot of work to do initially and b) a lot of > work to maintain as the thrift data structures evolve. > > It seems like benchmarking at that granularity might not be worth the > hassle. It sounds like the lower-level microbenchmark you've added is maybe > simpler. > > It could be very worthwhile to benchmark the combined planning + scheduling > process, since that would presumably require less plumbing. > > On Fri, Nov 11, 2016 at 4:49 AM, Lars Volker <[email protected]> wrote: > > > Hi all, > > > > Here is a change <https://gerrit.cloudera.org/4554> that implements a > > benchmark for SimpleScheduler::ComputeScanRangeAssigment() to address > > IMPALA-4086 <https://issues.cloudera.org/browse/IMPALA-4086>. > > > > I would like to discuss whether it is possible to run the benchmark > against > > the Schedule() method instead. This would require changes to the > scheduler > > test utility classes in simple-scheduler-test-util.h to create a > > TQueryExecRequest message suitable for calling Schedule(). > > > > Currently we compute these fields before calling > > ComputeScanRangeAssignment(), which are basically what is contained in a > > single plan node. > > > > BackendConfig > > > vector<TScanRangeLocations> > > > vector<TNetworkAddress> > > > TQueryOptions > > > > > > To build a schedule object we need to build a TQueryExecRequest, which > has > > 14 fields. The complex ones are: > > > > optional Descriptors.TDescriptorTable desc_tbl > > > optional list<Planner.TPlanFragment> fragments > > > optional list<i32> dest_fragment_idx > > > optional map<Types.TPlanNodeId, list<Planner.TScanRangeLocations>> > > > per_node_scan_ranges > > > optional list<TPlanExecInfo> mt_plan_exec_info > > > optional Results.TResultSetMetadata result_set_metadata > > > optional TFinalizeParams finalize_params > > > required ImpalaInternalService.TQueryCtx query_ctx > > > optional string query_plan > > > required list<Types.TNetworkAddress> host_list > > > optional LineageGraph.TLineageGraph lineage_graph > > > > > > Some of these members have other dependencies, for example the fragments > > have the plan inside, which has all plan nodes: > > > > TQueryExecRequest: > > > list<Planner.TPlanFragment> fragments > > > partition.type > > > plan.nodes[node_id] > > > node_id (for dcheck) > > > node.hdfs_scan_node (can be unset) > > > idx (for sorting in query-schedule) > > > TQueryCtx query_ctx (only for query options, which we already have) > > > > > > I think it makes sense to benchmark ComputeScanRangeAssignment() in > > isolation, since its implementation is reasonably complex, i.e. not just > > linear in the input size. In order to benchmark Schedule(), we should > first > > consider writing proper unit tests for the SimpleScheduler and extend the > > test utility code where necessary to do so. > > > > I curious for any feedback. Thanks, Lars > > >
