Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/22222
@xuanyuanking, while this does remove the hack, it doesn't address the
underlying problem. The problem is that there is a single RDD, which may
contain InternalRow or may contain ColumnarBatch. Generated code knows how to
differentiate between the two and use the RDD contents correctly.
While this is an improvement because it uses the actual type of records in
the RDD, the work that needs to be done is to update the columnar case so that
it does return an `RDD[InternalRow]` for anyone that accesses data using that
RDD, and then update the generated code to detect a data source RDD and access
the underlying `RDD[ColumnarBatch]`.
Here's some pseudo-code to demonstrate what I mean. The current code does
something like this with a cast. Your change wouldn't fix the need to cast to
`RDD[ColumnarBatch]`:
```scala
def doExecute(rdd: DataSourceRDD[InternalRow]) { // with your change,
DataSourceRDD[_]
if (rdd.isColumnar) {
doExecuteColumnarBatch(rdd.asInstanceOf[RDD[ColumnarBatch]])
} else {
doExecuteRows(rdd)
}
}
```
I think that should be changed to something like this which is type safe:
```scala
def doExecute(rdd: DataSourceRDD[InternalRow]) {
if (rdd.isColumnar) {
doExecuteColumnarBatch(rdd.getColumnBatchRDD)
} else {
doExecuteRows(rdd)
}
}
```
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]