Hi, I have to decide whether to expose the implementation or hide it from the user and would like to hear some opinions about that.
The expression operations operate on DataSet[Row], where Row is basically a wrapper for an array of elements of different types. The expression API system keeps tracks of the names and types of these fields. Right now, when you have an operation like: // 'foo and 'bar are Scala symbols // they refer to fields named foo and // bar in the input data set val result = in.select('foo, 'bar) the result is a DataSet[Row]. This means two things: 1. The user can theoretically to a map operation on this where he manually accesses row fields, as in: in.map { row => (row.getField(0).asInstanceOf[Int], row.getField(1).asInstanceOf[String]) } 2. I cannot easily look at the whole structure of a query. Because queries are translated to DataSet operations one expression at a time, i.e.: val result = in1.join(in2).filter(...).select(...) results in a join operation, followed by a filter operation, followed by a map operation. If the translation would not happen one operator at-a-time, we could combine all the operations into one join operation. This would mean having a custom optimiser component for the expression API and bypassing the optimiser component we have for normal operator data flows. The question is now. Should I expose it as is, i.e. let expression operations result in DataSet[Row], or should I hide it behind another type of DataSet (ExpressionDataSet) so that we can later-on change the implementation details and perform any magic we want behind the scenes. Cheers, Aljoscha