ksbeyer commented on code in PR #51466: URL: https://github.com/apache/spark/pull/51466#discussion_r2223315147
########## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala: ########## @@ -222,6 +227,37 @@ case class InsertIntoHadoopFsRelationCommand( Seq.empty[Row] } + /** + * The JSON writer [[org.apache.spark.sql.catalyst.json.JacksonGenerator]] has a special feature + * that changes the null handling of top-level columns that have a default value such that a + * explicit null is written. This is detected today by looking for the metadata key + * [[ResolveDefaultColumnsUtils#EXISTS_DEFAULT_COLUMN_METADATA_KEY]] on the query attribute. + * This function copies this key from the table attribute to the query attribute only + * when a table metadata is available, only for JSON output, and only when the configuration + * requests the special feature. + * + * We should instead pass the table description down to the writers instead of using query + * attribute metadata, but this is a nontrivial change. + */ + private def markColumnsWithDefaultForJson(outputColumns: Seq[Attribute]): Seq[Attribute] = { + if (catalogTable.isEmpty || !fileFormat.isInstanceOf[JsonFileFormat] || Review Comment: My goal is to eliminate the propagation. I don't think we should propagate the table attribute metadata to query attributes. I want to limit the propagation for a documented purpose. Is should also file another jira to fix the json writer to get table info and remove this propagation. At a broader scale, the use of schema and attribute metadata seems pretty adhoc and error prone. Eg, I see questionable propagation of char/varchar info that makes it unclear where the length constraints / padding should be enforced. This forced me to add unnecessary aliases to hide the metdata that otherwise causes the write logic to get upset. I'm still unsure if / where the constraints are enforced. I think we should push to eliminate the metadata; perhaps the 1-pass analyzer can help propagate the right information during analysis rather than adhoc tags. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org