[GitHub] spark pull request #22157: [SPARK-25126][SQL] Avoid creating Reader for all ...

viirya Thu, 23 Aug 2018 15:24:02 -0700

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22157#discussion_r212475271
  
    --- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala
 ---
    @@ -79,9 +79,10 @@ object OrcUtils extends Logging {
         val ignoreCorruptFiles = 
sparkSession.sessionState.conf.ignoreCorruptFiles
         val conf = sparkSession.sessionState.newHadoopConf()
         // TODO: We need to support merge schema. Please see SPARK-11412.
    -    files.map(_.getPath).flatMap(readSchema(_, conf, 
ignoreCorruptFiles)).headOption.map { schema =>
    -      logDebug(s"Reading schema from file $files, got Hive schema string: 
$schema")
    -      
CatalystSqlParser.parseDataType(schema.toString).asInstanceOf[StructType]
    +    files.toIterator.map(file => readSchema(file.getPath, conf, 
ignoreCorruptFiles)).collectFirst {
    +      case Some(schema) =>
    +        logDebug(s"Reading schema from file $files, got Hive schema 
string: $schema")
    +        
CatalystSqlParser.parseDataType(schema.toString).asInstanceOf[StructType]
    --- End diff --
    
    Yeah, it is only ignored during reading schema.
    
    The change is the timing when the corrupt files are detected. Now it is 
postponed to actually reading file contents.
    
    That might not be a big deal, though in user experience it is better to 
throw such exception early.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22157: [SPARK-25126][SQL] Avoid creating Reader for all ...

Reply via email to