It sounds like this is a) a lot of work to do initially and b) a lot of
work to maintain as the thrift data structures evolve.

It seems like benchmarking at that granularity might not be worth the
hassle. It sounds like the lower-level microbenchmark you've added is maybe
simpler.

It could be very worthwhile to benchmark the combined planning + scheduling
process, since that would presumably require less plumbing.

On Fri, Nov 11, 2016 at 4:49 AM, Lars Volker <[email protected]> wrote:

> Hi all,
>
> Here is a change <https://gerrit.cloudera.org/4554> that implements a
> benchmark for SimpleScheduler::ComputeScanRangeAssigment() to address
> IMPALA-4086 <https://issues.cloudera.org/browse/IMPALA-4086>.
>
> I would like to discuss whether it is possible to run the benchmark against
> the Schedule() method instead. This would require changes to the scheduler
> test utility classes in simple-scheduler-test-util.h to create a
> TQueryExecRequest message suitable for calling Schedule().
>
> Currently we compute these fields before calling
> ComputeScanRangeAssignment(), which are basically what is contained in a
> single plan node.
>
> BackendConfig
> > vector<TScanRangeLocations>
> > vector<TNetworkAddress>
> > TQueryOptions
>
>
> To build a schedule object we need to build a TQueryExecRequest, which has
> 14 fields. The complex ones are:
>
> optional Descriptors.TDescriptorTable desc_tbl
> > optional list<Planner.TPlanFragment> fragments
> > optional list<i32> dest_fragment_idx
> > optional map<Types.TPlanNodeId, list<Planner.TScanRangeLocations>>
> > per_node_scan_ranges
> > optional list<TPlanExecInfo> mt_plan_exec_info
> > optional Results.TResultSetMetadata result_set_metadata
> > optional TFinalizeParams finalize_params
> > required ImpalaInternalService.TQueryCtx query_ctx
> > optional string query_plan
> > required list<Types.TNetworkAddress> host_list
> > optional LineageGraph.TLineageGraph lineage_graph
>
>
> Some of these members have other dependencies, for example the fragments
> have the plan inside, which has all plan nodes:
>
> TQueryExecRequest:
> >  list<Planner.TPlanFragment> fragments
> >   partition.type
> >   plan.nodes[node_id]
> >    node_id (for dcheck)
> >    node.hdfs_scan_node (can be unset)
> >   idx (for sorting in query-schedule)
> >  TQueryCtx query_ctx (only for query options, which we already have)
>
>
> I think it makes sense to benchmark ComputeScanRangeAssignment() in
> isolation, since its implementation is reasonably complex, i.e. not just
> linear in the input size. In order to benchmark Schedule(), we should first
> consider writing proper unit tests for the SimpleScheduler and extend the
> test utility code where necessary to do so.
>
> I curious for any feedback. Thanks, Lars
>

Reply via email to