spark git commit: [SPARK-23963][SQL] Properly handle large number of columns in query on text-based Hive table

lixiao Wed, 18 Apr 2018 09:51:02 -0700

Repository: spark
Updated Branches:
  refs/heads/branch-2.2 a902323fb -> 041aec4e1



[SPARK-23963][SQL] Properly handle large number of columns in query on 
text-based Hive table

## What changes were proposed in this pull request?

TableReader would get disproportionately slower as the number of columns in the 
query increased.

I fixed the way TableReader was looking up metadata for each column in the row. 
Previously, it had been looking up this data in linked lists, accessing each 
linked list by an index (column number). Now it looks up this data in arrays, 
where indexing by column number works better.

## How was this patch tested?

Manual testing
All sbt unit tests
python sql tests

Author: Bruce Robbins <[email protected]>

Closes #21043 from bersprockets/tabreadfix.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/041aec4e
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/041aec4e
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/041aec4e

Branch: refs/heads/branch-2.2
Commit: 041aec4e1bfb4f3c2d4db6761486f3523102c75e
Parents: a902323
Author: Bruce Robbins <[email protected]>
Authored: Fri Apr 13 14:05:04 2018 -0700
Committer: gatorsmile <[email protected]>
Committed: Wed Apr 18 09:50:13 2018 -0700

----------------------------------------------------------------------
 .../src/main/scala/org/apache/spark/sql/hive/TableReader.scala     | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/041aec4e/sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala
----------------------------------------------------------------------
diff --git 
a/sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala 
b/sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala
index a0e379f..11795ff 100644
--- a/sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala
+++ b/sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala
@@ -381,7 +381,7 @@ private[hive] object HadoopTableReader extends 
HiveInspectors with Logging {
 
     val (fieldRefs, fieldOrdinals) = nonPartitionKeyAttrs.map { case (attr, 
ordinal) =>
       soi.getStructFieldRef(attr.name) -> ordinal
-    }.unzip
+    }.toArray.unzip
 
     /**
      * Builds specific unwrappers ahead of time according to object inspector


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

spark git commit: [SPARK-23963][SQL] Properly handle large number of columns in query on text-based Hive table

Reply via email to