chenhao-db opened a new pull request, #56756:
URL: https://github.com/apache/spark/pull/56756
### What changes were proposed in this pull request?
Adds a new JSON reader option `explodeEmbeddedArray` for reading JSON files
that are object documents with a top-level array-valued field, e.g. `{...,
"Records": [record1, record2, ...], ...}`. The option value names the array
field to explode; each element of that array becomes one row, stored as a
single VARIANT column (like `singleVariantColumn`, with which it is mutually
exclusive). The inferred schema names the variant column after the array field,
and a user-specified schema may give the column any name.
The records are split out of the array by a new streaming
`EmbeddedArraySplitter`, and each record is then parsed by the unchanged JSON
parsing logic, so neither the whole array text nor the result rows are ever
buffered in memory. The scan is whole-file: a new `EmbeddedArrayJsonDataSource`
reports `isSplitable = false`, which both the V1 `JsonFileFormat` and the V2
`JsonScan` already consult, so the option works in both v1 and v2 scans
(including compressed files, partitions, and streaming reads). The option is
rejected in `from_json` and `json(Dataset[String])`, where each input is
already a single document.
It is an intentional design decision that anything outside the array is
ignored, because:
1. Users only care about the records array in common use cases (e.g. AWS
CloudTrail).
2. It is difficult to implement if we want to achive both: avoid buffering
the whole array, scanning the file only once.
### Why are the changes needed?
Some common JSON formats (e.g. AWS CloudTrail) wrap their records in a
single top-level array inside one large object document. Today, ingesting such
files requires materializing the whole array, which requires all elements to
stay in memory at the same time and can easily cause OOM. This option makes
that ingestion efficient by streaming the records out one at a time as if they
were independent input records, never buffering the array.
### Does this PR introduce _any_ user-facing change?
Yes — a new opt-in JSON reader option `explodeEmbeddedArray`. There is no
behavior change unless the option is set. Example:
```scala
// file.json: {"Records": [{"a": 1}, {"b": 2}]}
spark.read.format("json").option("explodeEmbeddedArray",
"Records").load(path)
// => 2 rows, one VARIANT column named "Records": {"a":1}, {"b":2}
```
It also adds three new error conditions:
`EXPLODE_EMBEDDED_ARRAY_CONFLICTING_OPTION`,
`EXPLODE_EMBEDDED_ARRAY_UNSUPPORTED_USAGE`, and
`INVALID_EXPLODE_EMBEDDED_ARRAY_SCHEMA`.
### How was this patch tested?
New `EmbeddedArraySplitterSuite` (splitter unit tests, including custom
array field names, nesting/escapes, scalars, BOM, truncated input, and inputs
larger than the internal buffer) and
`ExplodeEmbeddedArrayJsonV1Suite`/`ExplodeEmbeddedArrayJsonV2Suite`
(end-to-end: schema inference/validation, array field names other than
`Records`, user schemas that rename the variant column, malformed records in
all parse modes, truncated/compressed/partitioned/streaming inputs, option
conflicts, the `from_json`/`json(Dataset[String])` rejections, and equivalence
with reading the records as ndjson with `singleVariantColumn`). `JsonSuite`'s
option-validation test was updated for the new option. All pass.
### Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Code (Opus 4.8)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]