Design Question in Expression API

Aljoscha Krettek Thu, 29 Jan 2015 07:49:14 -0800

Hi,
I have to decide whether to expose the implementation or hide it from
the user and would like to hear some opinions about that.


The expression operations operate on DataSet[Row], where Row is
basically a wrapper for an array of elements of different types. The
expression API system keeps tracks of the names and types of these
fields. Right now, when you have an operation like:

// 'foo and 'bar are Scala symbols
// they refer to fields named foo and
// bar in the input data set
val result = in.select('foo, 'bar)

the result is a DataSet[Row]. This means two things:

1. The user can theoretically to a map
operation on this where he manually accesses row fields, as in:

in.map { row => (row.getField(0).asInstanceOf[Int],
row.getField(1).asInstanceOf[String]) }

2. I cannot easily look at the whole structure of a query. Because
queries are translated to DataSet
operations one expression at a time, i.e.:

val result = in1.join(in2).filter(...).select(...)

results in a join operation, followed by a filter operation, followed
by a map operation. If the translation would not happen one operator
at-a-time, we could combine all the operations into one join
operation. This would mean having a custom optimiser component for the
expression API and bypassing the optimiser component we have for
normal operator data flows.

The question is now. Should I expose it as is, i.e. let expression
operations result in DataSet[Row], or should I hide it behind another
type of DataSet (ExpressionDataSet) so that we can later-on change the
implementation details and perform any magic we want behind the
scenes.

Cheers,
Aljoscha

Design Question in Expression API

Reply via email to