Github user jose-torres commented on a diff in the pull request:
https://github.com/apache/spark/pull/20933#discussion_r178317653
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala ---
@@ -187,6 +189,14 @@ class DataFrameReader private[sql](sparkSession:
SparkSession) extends Logging {
"read files of Hive data source directly.")
}
+ // SPARK-23817 Since datasource V2 didn't support reading multiple
files yet,
+ // ORC V2 is only used when loading single file path.
+ val allPaths = CaseInsensitiveMap(extraOptions.toMap).get("path") ++
paths
+ val orcV2 = OrcDataSourceV2.satisfy(sparkSession, source,
allPaths.toSeq)
+ if (orcV2.isDefined) {
+ option("path", allPaths.head)
+ source = orcV2.get
+ }
--- End diff --
What about bucketed reads? WIll they need a similar change here, or is that
lack of support handled elsewhere? (Or am I misunderstanding something about
that part of the description - I'm not super familiar with the ORC source)
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]