Github user dongjoon-hyun commented on a diff in the pull request:
https://github.com/apache/spark/pull/19943#discussion_r156472027
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala
---
@@ -139,15 +146,25 @@ class OrcFileFormat
}
}
+ val resultSchema = StructType(requiredSchema.fields ++
partitionSchema.fields)
+ val enableVectorizedReader =
sparkSession.sessionState.conf.orcVectorizedReaderEnabled &&
+ supportBatch(sparkSession, resultSchema)
+
val broadcastedConf =
sparkSession.sparkContext.broadcast(new
SerializableConfiguration(hadoopConf))
val isCaseSensitive =
sparkSession.sessionState.conf.caseSensitiveAnalysis
(file: PartitionedFile) => {
val conf = broadcastedConf.value.value
+ val filePath = new Path(new URI(file.filePath))
+
+ val fs = filePath.getFileSystem(conf)
+ val readerOptions = OrcFile.readerOptions(conf).filesystem(fs)
+ val reader = OrcFile.createReader(filePath, readerOptions)
--- End diff --
The reader is used here, too. This extraction prevents redundant creation
of the reader.
```scala
batchReader.setRequiredSchema(
OrcUtils.getFixedTypeDescription(reader.getSchema, dataSchema),
```
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]