[GitHub] spark pull request #22157: [SPARK-25126][SQL] Avoid creating Reader for all ...

dongjoon-hyun Thu, 23 Aug 2018 09:26:31 -0700

Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22157#discussion_r212373914
  
    --- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala
 ---
    @@ -79,9 +79,10 @@ object OrcUtils extends Logging {
         val ignoreCorruptFiles = 
sparkSession.sessionState.conf.ignoreCorruptFiles
         val conf = sparkSession.sessionState.newHadoopConf()
         // TODO: We need to support merge schema. Please see SPARK-11412.
    -    files.map(_.getPath).flatMap(readSchema(_, conf, 
ignoreCorruptFiles)).headOption.map { schema =>
    -      logDebug(s"Reading schema from file $files, got Hive schema string: 
$schema")
    -      
CatalystSqlParser.parseDataType(schema.toString).asInstanceOf[StructType]
    +    files.toIterator.map(file => readSchema(file.getPath, conf, 
ignoreCorruptFiles)).collectFirst {
    +      case Some(schema) =>
    +        logDebug(s"Reading schema from file $files, got Hive schema 
string: $schema")
    +        
CatalystSqlParser.parseDataType(schema.toString).asInstanceOf[StructType]
    --- End diff --
    
    @viirya . The corrupt files are not ignored. Spark will throw 
`SparkException` while reading the content.
    > Now if Orc source reads the first valid schema, it doesn't read other Orc 
files further. So the corrupt files are ignored when 
SQLConf.IGNORE_CORRUPT_FILES is false.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22157: [SPARK-25126][SQL] Avoid creating Reader for all ...

Reply via email to