nchammas commented on issue #22775: [SPARK-24709][SQL][FOLLOW-UP] Make schema_of_json's input json as literal only URL: https://github.com/apache/spark/pull/22775#issuecomment-594197783 This change seems like a step back from the original version introduced in #21686. I have a DataFrame with a JSON column. I suspect the JSON values have an inconsistent schema, so I want to first check whether a single schema can apply before trying to parse the column. With the original version of `schema_of_json()`, I could do something like this to check whether or not I have a consistent schema: ```python df.select(schema_of_json(...)).distinct().count() ``` But now I can't do that. I can't even wrap `schema_of_json()` in a UDF to get something like that, because it returns a `Column`. It seems surprising from an API design point of view for a function to only accept literals but return Columns. And it seems inconsistent with the general tenor of Spark SQL functions for a function _not_ to accept Columns as input. Can we revisit the design of this function (as well as that of its cousin, `schema_of_csv()`)? Alternately, would it make sense to deprecate these functions and instead recommend the approach that @HyukjinKwon suggested? > Actually, that usecase can more easily accomplished by simply inferring schema by JSON datasource. Yea, I indeed suggested that as workaround for this issue before. Let's say, `spark.read.json(df.select("json").as[String]).schema`. This demonstrates good Spark style (at least to me), and perhaps we can just promote this as a solution and do away with these functions. For the passing reader, the Python equivalent of Hyukjin's suggestion is: ```python spark.read.json(df.rdd.map(lambda x: x[0])).schema ```
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
