Re: How to proceed with IMPALA-4086 (Benchmark for SimpleScheduler)

Lars Volker Wed, 16 Nov 2016 11:36:46 -0800

Thank you Tim for your response. After talking to Marcel in person I filed
https://issues.cloudera.org/browse/IMPALA-4496 to track this effort
separately.


On Tue, Nov 15, 2016 at 4:14 PM, Tim Armstrong <[email protected]>
wrote:

> It sounds like this is a) a lot of work to do initially and b) a lot of
> work to maintain as the thrift data structures evolve.
>
> It seems like benchmarking at that granularity might not be worth the
> hassle. It sounds like the lower-level microbenchmark you've added is maybe
> simpler.
>
> It could be very worthwhile to benchmark the combined planning + scheduling
> process, since that would presumably require less plumbing.
>
> On Fri, Nov 11, 2016 at 4:49 AM, Lars Volker <[email protected]> wrote:
>
> > Hi all,
> >
> > Here is a change <https://gerrit.cloudera.org/4554> that implements a
> > benchmark for SimpleScheduler::ComputeScanRangeAssigment() to address
> > IMPALA-4086 <https://issues.cloudera.org/browse/IMPALA-4086>.
> >
> > I would like to discuss whether it is possible to run the benchmark
> against
> > the Schedule() method instead. This would require changes to the
> scheduler
> > test utility classes in simple-scheduler-test-util.h to create a
> > TQueryExecRequest message suitable for calling Schedule().
> >
> > Currently we compute these fields before calling
> > ComputeScanRangeAssignment(), which are basically what is contained in a
> > single plan node.
> >
> > BackendConfig
> > > vector<TScanRangeLocations>
> > > vector<TNetworkAddress>
> > > TQueryOptions
> >
> >
> > To build a schedule object we need to build a TQueryExecRequest, which
> has
> > 14 fields. The complex ones are:
> >
> > optional Descriptors.TDescriptorTable desc_tbl
> > > optional list<Planner.TPlanFragment> fragments
> > > optional list<i32> dest_fragment_idx
> > > optional map<Types.TPlanNodeId, list<Planner.TScanRangeLocations>>
> > > per_node_scan_ranges
> > > optional list<TPlanExecInfo> mt_plan_exec_info
> > > optional Results.TResultSetMetadata result_set_metadata
> > > optional TFinalizeParams finalize_params
> > > required ImpalaInternalService.TQueryCtx query_ctx
> > > optional string query_plan
> > > required list<Types.TNetworkAddress> host_list
> > > optional LineageGraph.TLineageGraph lineage_graph
> >
> >
> > Some of these members have other dependencies, for example the fragments
> > have the plan inside, which has all plan nodes:
> >
> > TQueryExecRequest:
> > >  list<Planner.TPlanFragment> fragments
> > >   partition.type
> > >   plan.nodes[node_id]
> > >    node_id (for dcheck)
> > >    node.hdfs_scan_node (can be unset)
> > >   idx (for sorting in query-schedule)
> > >  TQueryCtx query_ctx (only for query options, which we already have)
> >
> >
> > I think it makes sense to benchmark ComputeScanRangeAssignment() in
> > isolation, since its implementation is reasonably complex, i.e. not just
> > linear in the input size. In order to benchmark Schedule(), we should
> first
> > consider writing proper unit tests for the SimpleScheduler and extend the
> > test utility code where necessary to do so.
> >
> > I curious for any feedback. Thanks, Lars
> >
>

Re: How to proceed with IMPALA-4086 (Benchmark for SimpleScheduler)

Reply via email to