[GitHub] spark issue #14509: [SPARK-16924][SQL] - Support option("inferSchema", true)...

xwu0226 Sun, 07 Aug 2016 23:37:33 -0700

Github user xwu0226 commented on the issue:

    https://github.com/apache/spark/pull/14509
  
    @HyukjinKwon Thank you for reviewing and pointing this out.
    Looking deeper into the code, I agree with you. Actually, even CSV file can 
have default schema with StringType for each column, which means it does not 
require inferring schema or `userSpecifiedSchema`. And JSON file format can 
inferSchema automatically. Text File format also only uses single column 
"value" of StringType as schema. So the original error thrown in 
`DataSource.sourceSchema` should not be related to whether or not inferSchema 
is specified. 
    
    Refer to the existing code:
    ```
    val isSchemaInferenceEnabled = 
sparkSession.conf.get(SQLConf.STREAMING_SCHEMA_INFERENCE)
    val isTextSource = providingClass == classOf[text.TextFileFormat]
     // If the schema inference is disabled, only text sources require schema 
to be specified
     if (!isSchemaInferenceEnabled && !isTextSource && 
userSpecifiedSchema.isEmpty) {
          throw new IllegalArgumentException(
                "Schema must be specified when creating a streaming source 
DataFrame. " +
                  "If some files already exist in the directory, then depending 
on the file format " +
                  "you may be able to create a static DataFrame on that 
directory with " +
                  "'spark.read.load(directory)' and infer schema from it.")
      }
    ```
    I think the real problem now are:
    1. Why do we even need to have a checking of `inferSchema`, 
`userSpecifiedSchema`, and `isTextSource` in `DataSource.sourceSchema` for 
FileFormat case?   
    2. `!isTextSource` condition is the wrong, it should be `isTextSource`, 
according to the comment. This is the direct cause of the error, since CSV file 
or JSON file are not TextFileFormat. Plus, text file only has default one 
column schema.
    3. In order to make SQLConf setting "`spark.sql.streaming.schemaInference`" 
or `option("inferSchema", true)` to be effective for CSVFileFormat, we need 
code to handle that in `DataSource.inferFileFormatSchema` for CSVFileFormat 
case. 
    
    What do you think?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #14509: [SPARK-16924][SQL] - Support option("inferSchema", true)...

Reply via email to