Github user rdblue commented on the issue:

    https://github.com/apache/spark/pull/22222
  
    @xuanyuanking, while this does remove the hack, it doesn't address the 
underlying problem. The problem is that there is a single RDD, which may 
contain InternalRow or may contain ColumnarBatch. Generated code knows how to 
differentiate between the two and use the RDD contents correctly.
    
    While this is an improvement because it uses the actual type of records in 
the RDD, the work that needs to be done is to update the columnar case so that 
it does return an `RDD[InternalRow]` for anyone that accesses data using that 
RDD, and then update the generated code to detect a data source RDD and access 
the underlying `RDD[ColumnarBatch]`.
    
    Here's some pseudo-code to demonstrate what I mean. The current code does 
something like this with a cast. Your change wouldn't fix the need to cast to 
`RDD[ColumnarBatch]`:
    ```scala
    def doExecute(rdd: DataSourceRDD[InternalRow]) { // with your change, 
DataSourceRDD[_]
      if (rdd.isColumnar) {
        doExecuteColumnarBatch(rdd.asInstanceOf[RDD[ColumnarBatch]])
      } else {
        doExecuteRows(rdd)
      }
    }
    ```
    
    I think that should be changed to something like this which is type safe:
    ```scala
    def doExecute(rdd: DataSourceRDD[InternalRow]) {
      if (rdd.isColumnar) {
        doExecuteColumnarBatch(rdd.getColumnBatchRDD)
      } else {
        doExecuteRows(rdd)
      }
    }
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to