[ https://issues.apache.org/jira/browse/SPARK-12436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15211804#comment-15211804 ]
Michel Lemay commented on SPARK-12436: -------------------------------------- I'm working on a new PR for this issue since the other one is inactive. However, I'm puzzled to see this in the code: {code} case VALUE_STRING if parser.getTextLength < 1 => // Zero length strings and nulls have special handling to deal // with JSON generators that do not distinguish between the two. // To accurately infer types for empty strings that are really // meant to represent nulls we assume that the two are isomorphic // but will defer treating null fields as strings until all the // record fields' types have been combined. NullType {code} With the modification proposed in canonicalizeType to remove the case NullType => Some(StringType), we will end up with NullTypes for what is explicitly is an empty string. For instance: { "f1": "" } will have the Schema: StructType(StructField("f1", NullType, true)) Is that Ok ? > If all values of a JSON field is null, JSON's inferSchema should return > NullType instead of StringType > ------------------------------------------------------------------------------------------------------ > > Key: SPARK-12436 > URL: https://issues.apache.org/jira/browse/SPARK-12436 > Project: Spark > Issue Type: Bug > Components: SQL > Reporter: Reynold Xin > Labels: starter > > Right now, JSON's inferSchema will return {{StringType}} for a field that > always has null values or an {{ArrayType(StringType)}} for a field that > always has empty array values. Although this behavior makes writing JSON data > to other data sources easy (i.e. when writing data, we do not need to remove > those {{NullType}} or {{ArrayType(NullType)}} columns), it makes downstream > application hard to reason about the actual schema of the data and thus makes > schema merging hard. We should allow JSON's inferSchema returns {{NullType}} > and {{ArrayType(NullType)}}. Also, we need to make sure that when we write > data out, we should remove those {{NullType}} or {{ArrayType(NullType)}} > columns first. > Besides {{NullType}} and {{ArrayType(NullType)}}, we may need to do the same > thing for empty {{StructType}}s (i.e. a {{StructType}} having 0 fields). > To finish this work, we need to finish the following sub-tasks: > * Allow JSON's inferSchema returns {{NullType}} and {{ArrayType(NullType)}}. > * Determine whether we need to add the operation of removing {{NullType}} and > {{ArrayType(NullType)}} columns from the data that will be write out for all > data sources (i.e. data sources based our data source API and Hive tables). > Or, we should just add this operation for certain data sources (e.g. > Parquet). For example, we may not need this operation for Hive because Hive > has VoidObjectInspector. > * Implement the change and get it merged to Spark master. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org