Github user xwu0226 commented on the issue:
https://github.com/apache/spark/pull/14509
@HyukjinKwon Thank you for reviewing and pointing this out.
Looking deeper into the code, I agree with you. Actually, even CSV file can
have default schema with StringType for each column, which means it does not
require inferring schema or `userSpecifiedSchema`. And JSON file format can
inferSchema automatically. Text File format also only uses single column
"value" of StringType as schema. So the original error thrown in
`DataSource.sourceSchema` should not be related to whether or not inferSchema
is specified.
Refer to the existing code:
```
val isSchemaInferenceEnabled =
sparkSession.conf.get(SQLConf.STREAMING_SCHEMA_INFERENCE)
val isTextSource = providingClass == classOf[text.TextFileFormat]
// If the schema inference is disabled, only text sources require schema
to be specified
if (!isSchemaInferenceEnabled && !isTextSource &&
userSpecifiedSchema.isEmpty) {
throw new IllegalArgumentException(
"Schema must be specified when creating a streaming source
DataFrame. " +
"If some files already exist in the directory, then depending
on the file format " +
"you may be able to create a static DataFrame on that
directory with " +
"'spark.read.load(directory)' and infer schema from it.")
}
```
I think the real problem now are:
1. Why do we even need to have a checking of `inferSchema`,
`userSpecifiedSchema`, and `isTextSource` in `DataSource.sourceSchema` for
FileFormat case?
2. `!isTextSource` condition is the wrong, it should be `isTextSource`,
according to the comment. This is the direct cause of the error, since CSV file
or JSON file are not TextFileFormat. Plus, text file only has default one
column schema.
3. In order to make SQLConf setting "`spark.sql.streaming.schemaInference`"
or `option("inferSchema", true)` to be effective for CSVFileFormat, we need
code to handle that in `DataSource.inferFileFormatSchema` for CSVFileFormat
case.
What do you think?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]