[GitHub] [spark] peter-toth commented on a change in pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

GitBox Fri, 18 Sep 2020 04:52:07 -0700


peter-toth commented on a change in pull request #29737:
URL: https://github.com/apache/spark/pull/29737#discussion_r490895572




##########
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala
##########
@@ -142,13 +142,17 @@ object OrcUtils extends Logging {
       reader: Reader,
       conf: Configuration): Option[(Array[Int], Boolean)] = {
     val orcFieldNames = reader.getSchema.getFieldNames.asScala
+    val forcePositionalEvolution = 
OrcConf.FORCE_POSITIONAL_EVOLUTION.getBoolean(conf)

Review comment:
       Sorry, I'm not sure I get your point.
   In `requestedColumnIds()` we map `requiredSchema` to the schema in the ORC 
files (`orcFieldNames`) and actually Spark is prepared for that the schema in 
HMS and in the file doesn't always match: 
https://github.com/apache/spark/pull/29737/files#diff-3fb8426b690ab771c4f67f9cad336498L149
 (the file is written by an old version of Hive).
   It turned out that a schema mismatch can happen with newer version of Hive 
(columns in the file doesn't start with `_col`) too. Because simply setting 
`orc.force.positional.evolution=true` and then doing a column rename in Hive 
also results mismatch in Spark and in that case Spark returns `null`s now.
   It seemed a good idea to add support for this setting to our data source but 
if that is not the good way to deal with the issue please let me know. (I've 
updated the PR description a bit to make it more clear what I'm trying to fix.)

##########
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala
##########
@@ -142,13 +142,17 @@ object OrcUtils extends Logging {
       reader: Reader,
       conf: Configuration): Option[(Array[Int], Boolean)] = {
     val orcFieldNames = reader.getSchema.getFieldNames.asScala
+    val forcePositionalEvolution = 
OrcConf.FORCE_POSITIONAL_EVOLUTION.getBoolean(conf)

Review comment:
       Sorry, I'm not sure I get your point.
   In `requestedColumnIds()` we map `requiredSchema` to the schema in the ORC 
files (`orcFieldNames`) and actually Spark is prepared for that the schema in 
HMS and in the file doesn't always match: 
https://github.com/apache/spark/pull/29737/files#diff-3fb8426b690ab771c4f67f9cad336498L149
 (the file is written by an old version of Hive).
   It turned out that a schema mismatch can happen with newer version of Hive 
(columns in the file doesn't start with `_col`) too. Because simply setting 
`orc.force.positional.evolution=true` and then doing a column rename in Hive 
also results mismatch in Spark and in that case Spark returns `null`s now.
   It seemed a good idea to add support for this setting to our data source but 
if that is not the good way to deal with the issue please let me know.
   (I've updated the PR description a bit to make it more clear what I'm trying 
to fix.)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] peter-toth commented on a change in pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

Reply via email to