Github user jose-torres commented on a diff in the pull request:
https://github.com/apache/spark/pull/20933#discussion_r178234982
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala ---
@@ -187,6 +189,14 @@ class DataFrameReader private[sql](sparkSession:
SparkSession) extends Logging {
"read files of Hive data source directly.")
}
+ // SPARK-23817 Since datasource V2 didn't support reading multiple
files yet,
+ // ORC V2 is only used when loading single file path.
+ val allPaths = CaseInsensitiveMap(extraOptions.toMap).get("path") ++
paths
+ val orcV2 = OrcDataSourceV2.satisfy(sparkSession, source,
allPaths.toSeq)
+ if (orcV2.isDefined) {
+ option("path", allPaths.head)
+ source = orcV2.get
+ }
--- End diff --
It seems weird that DataFrameReader is modified here. Will DataSourceV2
implementations generally need to modify DataFrameReader, or is it just a
temporary hack because of the mentioned lack of support? In the latter case, is
there a plan to add this support soon?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]