nchammas commented on issue #22775: [SPARK-24709][SQL][FOLLOW-UP] Make 
schema_of_json's input json as literal only
URL: https://github.com/apache/spark/pull/22775#issuecomment-594197783
 
 
   This change seems like a step back from the original version introduced in 
#21686.
   
   I have a DataFrame with a JSON column. I suspect the JSON values have an 
inconsistent schema, so I want to first check whether a single schema can apply 
before trying to parse the column.
   
   With the original version of `schema_of_json()`, I could do something like 
this to check whether or not I have a consistent schema:
   
   ```python
   df.select(schema_of_json(...)).distinct().count()
   ```
   
   But now I can't do that. I can't even wrap `schema_of_json()` in a UDF to 
get something like that, because it returns a `Column`. It seems surprising 
from an API design point of view for a function to only accept literals but 
return Columns. And it seems inconsistent with the general tenor of Spark SQL 
functions for a function _not_ to accept Columns as input.
   
   Can we revisit the design of this function (as well as that of its cousin, 
`schema_of_csv()`)?
   
   Alternately, would it make sense to deprecate these functions and instead 
recommend the approach that @HyukjinKwon suggested?
   
   > Actually, that usecase can more easily accomplished by simply inferring 
schema by JSON datasource. Yea, I indeed suggested that as workaround for this 
issue before. Let's say, `spark.read.json(df.select("json").as[String]).schema`.
   
   This demonstrates good Spark style (at least to me), and perhaps we can just 
promote this as a solution and do away with these functions.
   
   For the passing reader, the Python equivalent of Hyukjin's suggestion is:
   
   ```python
   spark.read.json(df.rdd.map(lambda x: x[0])).schema
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to