andygrove opened a new issue, #3203:
URL: https://github.com/apache/datafusion-comet/issues/3203
## Description
Add native Comet support for Spark's `from_json` function (`JsonToStructs`
expression), which parses JSON strings into structured data types (StructType,
ArrayType, or MapType).
## Spark Specification
### Syntax
```sql
from_json(jsonStr, schema [, options])
```
### Arguments
| Argument | Type | Description |
|----------|------|-------------|
| jsonStr | StringType | Column containing JSON strings to parse |
| schema | DataType, StructType, ArrayType, MapType, or DDL String | Target
schema defining the structure of the parsed output |
| options | Map[String, String] (optional) | JSON parsing options |
### Return Type
Returns a column matching the provided schema type (struct, array, or map).
### Key Options
- `mode`: PERMISSIVE (default) or FAILFAST
- `dateFormat`: Format for parsing dates (default: yyyy-MM-dd)
- `timestampFormat`: Format for parsing timestamps
- `columnNameOfCorruptRecord`: Field to store malformed records (PERMISSIVE
mode)
### Edge Cases
- Null input returns null output
- Missing fields in JSON are set to null
- Empty string input returns null
- PERMISSIVE mode: malformed JSON returns row with parseable fields populated
- FAILFAST mode: throws exception on malformed JSON
- Field names are case-sensitive
- Extra fields in JSON not in schema are ignored
### Examples
```sql
-- Basic struct parsing
SELECT from_json('{"name":"Alice","age":30}', 'name STRING, age INT');
-- Result: {Alice, 30}
-- Parsing to MapType
SELECT from_json('{"key1":"value1"}', 'MAP<STRING,STRING>');
-- With options
SELECT from_json('{"a":1}', 'a INT', map('mode', 'FAILFAST'));
```
## Implementation Approach
1. **Scala Serde**: Create `CometJsonToStructs` in
`spark/src/main/scala/org/apache/comet/serde/`
- Handle schema serialization to protobuf
- Serialize options map
- Consider marking complex options as `Unsupported` initially
2. **Protobuf**: May need new message type in `expr.proto` for schema and
options
3. **Rust Implementation**: Implement JSON parsing in
`native/spark-expr/src/`
- Use `serde_json` for parsing
- Match Spark's null handling behavior
- Support PERMISSIVE and FAILFAST modes
4. **Potential Simplifications for Initial Implementation**:
- Start with StructType schemas only (defer ArrayType, MapType)
- Support common options only (mode, dateFormat, timestampFormat)
- UTC timezone only initially
## Difficulty
Large - requires protobuf changes, Rust JSON parsing implementation, and
careful compatibility testing for edge cases.
## References
- Spark documentation:
https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.from_json.html
- Contributor guide:
https://datafusion.apache.org/comet/contributor-guide/adding_a_new_expression.html
---
> **Note:** This issue was generated with AI assistance.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]