[ 
https://issues.apache.org/jira/browse/SPARK-24642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16524802#comment-16524802
 ] 

Maxim Gekk commented on SPARK-24642:
------------------------------------

> Do we want this as an aggregate function?

I thought of something similar to the inferSchema flag when CSV datasource 
triggers a separate job to infer schema for JSON files.

> I'm thinking it's better to just take a string and infers the schema on the 
> string.

In general it looks much more cheaper that scanning of full input by aggregate 
function but we have opportunity to minimize amount of row touched by the 
aggregate function via sampling or using just a few first row in partitions.

And what happens if some json strings are not complete like:
{code}
{"a": 1}
{"b": [1,2,3]}
{"a": 3, "b": [10, 11, 12]}
{code} 
in that case, each parsed json string will have different inferred schemas, 
right? Which schema we should assign to parsed json column?

> How would the query you provide compile if it is an aggregate function?

I am going to assign the from_json name to the FromJson case class, and write 
the following rule to trigger a job for replacing aggregate by a string literal 
like in the code snippet (thank you [~hvanhovell] for the code)
{code}
case class FromJson(child: Expression) extends Expression` {
 ...
}

class SchemaInferringRule(session: SparkSession) extends Rule[LogicalPlan] {
  override def apply(plan: LogicalPlan): LogicalPlan = {
    plan transform {
      case node =>
        node.transformExpressions {
          case FromJson(e) =>
            // Kick off inference
            val query = new QueryExecution(
              session,
              Project(Seq(Alias(InferSchema(e), "schema")()), node))
            val Array(row) = query.executedPlan.executeCollect()
            val schema = Literal(row.getUTF8String(0), StringType)
            new JsonToStructs(e, schema)
        }
    }
  }
}
{code}

> Add a function which infers schema from a JSON column
> -----------------------------------------------------
>
>                 Key: SPARK-24642
>                 URL: https://issues.apache.org/jira/browse/SPARK-24642
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.3.1
>            Reporter: Maxim Gekk
>            Priority: Minor
>
> Need to add new aggregate function - *infer_schema()*. The function should 
> infer schema for set of JSON strings. The result of the function is a schema 
> in DDL format (or JSON format).
> One of the use cases is passing output of *infer_schema()* to *from_json()*. 
> Currently, the from_json() function requires a schema as a mandatory 
> argument. It is possible to infer schema programmatically in Scala/Python and 
> pass it as the second argument but in SQL it is not possible. An user has to 
> pass schema as string literal in SQL. The new function should allow to use it 
> in SQL like in the example:
> {code:sql}
> select from_json(json_col, infer_schema(json_col))
> from json_table;
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to