Github user liancheng commented on a diff in the pull request:
https://github.com/apache/spark/pull/1072#discussion_r13736713
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/columnar/InMemoryColumnarTableScan.scala
---
@@ -55,14 +66,26 @@ private[sql] case class InMemoryColumnarTableScan(
cached.count()
cached
}
+}
+
+private[sql] case class InMemoryColumnarTableScan(
+ attributes: Seq[Attribute],
+ relation: InMemoryRelation)
+ extends LeafNode {
+
+ override def output: Seq[Attribute] = attributes
override def execute() = {
- cachedColumnBuffers.mapPartitions { iterator =>
+ relation.cachedColumnBuffers.mapPartitions { iterator =>
val columnBuffers = iterator.next()
assert(!iterator.hasNext)
new Iterator[Row] {
- val columnAccessors = columnBuffers.map(ColumnAccessor(_))
+ // Find the ordinals of the requested columns. If none are
requested, use the first.
+ val requestedColumns =
+ if (attributes.isEmpty) Seq(0) else
attributes.map(relation.output.indexOf(_))
--- End diff --
I'm not sure if I understand this correctly: is it because we don't know
the count of all the rows that we must scan at least 1 column here? If so, I
think we can simply record the count while building the in-memory columnar byte
array to avoid scanning any columns.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---