Github user kiszk commented on the pull request:
https://github.com/apache/spark/pull/11956#issuecomment-215696918
@davies I love your idea to use Apache Arrow or other in-memory format
instead of ```Array[Byte]``` for DataFrame cache. That is what I want to create
a JIRA entry in near future.
This idea is related to Part 1 (get data from columnar storage for
DataFrame cache under ```org.apache.spark.sql.execution.columnar```) in this
PR. Even if we use any in-memory format, an interface to code generation would
be ColumnVector. As you can see, we already succeeded to wrap ```Array[Byte]```
and Parquet representation by using methods such as ```getInt()``` in
```ColumnVector```.
In this PR, can we target Part 2 (codegen
under```org.apache.spark.sql.execution```) of this PR for 2.0 at least? This
is because Part 2 could accept any in-memory representation thru
```ColumnVector```. Part 2 is very small (less than 200 lines) after dropping
code for Part 1. Thus, it is easy to review.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]