[I] Add support for from_json (JsonToStructs) expression [datafusion-comet]

via GitHub Fri, 16 Jan 2026 07:29:59 -0800


andygrove opened a new issue, #3203:
URL: https://github.com/apache/datafusion-comet/issues/3203


   ## Description
   
   Add native Comet support for Spark's `from_json` function (`JsonToStructs` 
expression), which parses JSON strings into structured data types (StructType, 
ArrayType, or MapType).
   
   ## Spark Specification
   
   ### Syntax
   ```sql
   from_json(jsonStr, schema [, options])
   ```
   
   ### Arguments
   | Argument | Type | Description |
   |----------|------|-------------|
   | jsonStr | StringType | Column containing JSON strings to parse |
   | schema | DataType, StructType, ArrayType, MapType, or DDL String | Target 
schema defining the structure of the parsed output |
   | options | Map[String, String] (optional) | JSON parsing options |
   
   ### Return Type
   Returns a column matching the provided schema type (struct, array, or map).
   
   ### Key Options
   - `mode`: PERMISSIVE (default) or FAILFAST
   - `dateFormat`: Format for parsing dates (default: yyyy-MM-dd)
   - `timestampFormat`: Format for parsing timestamps
   - `columnNameOfCorruptRecord`: Field to store malformed records (PERMISSIVE 
mode)
   
   ### Edge Cases
   - Null input returns null output
   - Missing fields in JSON are set to null
   - Empty string input returns null
   - PERMISSIVE mode: malformed JSON returns row with parseable fields populated
   - FAILFAST mode: throws exception on malformed JSON
   - Field names are case-sensitive
   - Extra fields in JSON not in schema are ignored
   
   ### Examples
   ```sql
   -- Basic struct parsing
   SELECT from_json('{"name":"Alice","age":30}', 'name STRING, age INT');
   -- Result: {Alice, 30}
   
   -- Parsing to MapType
   SELECT from_json('{"key1":"value1"}', 'MAP<STRING,STRING>');
   
   -- With options
   SELECT from_json('{"a":1}', 'a INT', map('mode', 'FAILFAST'));
   ```
   
   ## Implementation Approach
   
   1. **Scala Serde**: Create `CometJsonToStructs` in 
`spark/src/main/scala/org/apache/comet/serde/`
      - Handle schema serialization to protobuf
      - Serialize options map
      - Consider marking complex options as `Unsupported` initially
   
   2. **Protobuf**: May need new message type in `expr.proto` for schema and 
options
   
   3. **Rust Implementation**: Implement JSON parsing in 
`native/spark-expr/src/`
      - Use `serde_json` for parsing
      - Match Spark's null handling behavior
      - Support PERMISSIVE and FAILFAST modes
   
   4. **Potential Simplifications for Initial Implementation**:
      - Start with StructType schemas only (defer ArrayType, MapType)
      - Support common options only (mode, dateFormat, timestampFormat)
      - UTC timezone only initially
   
   ## Difficulty
   
   Large - requires protobuf changes, Rust JSON parsing implementation, and 
careful compatibility testing for edge cases.
   
   ## References
   
   - Spark documentation: 
https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.from_json.html
   - Contributor guide: 
https://datafusion.apache.org/comet/contributor-guide/adding_a_new_expression.html
   
   ---
   > **Note:** This issue was generated with AI assistance.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Add support for from_json (JsonToStructs) expression [datafusion-comet]

Reply via email to