chenhao-db opened a new pull request, #56756:
URL: https://github.com/apache/spark/pull/56756

   ### What changes were proposed in this pull request?
   
   Adds a new JSON reader option `explodeEmbeddedArray` for reading JSON files 
that are object documents with a top-level array-valued field, e.g. `{..., 
"Records": [record1, record2, ...], ...}`. The option value names the array 
field to explode; each element of that array becomes one row, stored as a 
single VARIANT column (like `singleVariantColumn`, with which it is mutually 
exclusive). The inferred schema names the variant column after the array field, 
and a user-specified schema may give the column any name.
   
   The records are split out of the array by a new streaming 
`EmbeddedArraySplitter`, and each record is then parsed by the unchanged JSON 
parsing logic, so neither the whole array text nor the result rows are ever 
buffered in memory. The scan is whole-file: a new `EmbeddedArrayJsonDataSource` 
reports `isSplitable = false`, which both the V1 `JsonFileFormat` and the V2 
`JsonScan` already consult, so the option works in both v1 and v2 scans 
(including compressed files, partitions, and streaming reads). The option is 
rejected in `from_json` and `json(Dataset[String])`, where each input is 
already a single document.
   
   It is an intentional design decision that anything outside the array is 
ignored, because:
   1. Users only care about the records array in common use cases (e.g. AWS 
CloudTrail).
   2. It is difficult to implement if we want to achive both: avoid buffering 
the whole array, scanning the file only once.
   
   ### Why are the changes needed?
   
   Some common JSON formats (e.g. AWS CloudTrail) wrap their records in a 
single top-level array inside one large object document. Today, ingesting such 
files requires materializing the whole array, which requires all elements to 
stay in memory at the same time and can easily cause OOM. This option makes 
that ingestion efficient by streaming the records out one at a time as if they 
were independent input records, never buffering the array.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes — a new opt-in JSON reader option `explodeEmbeddedArray`. There is no 
behavior change unless the option is set. Example:
   
   ```scala
   // file.json: {"Records": [{"a": 1}, {"b": 2}]}
   spark.read.format("json").option("explodeEmbeddedArray", 
"Records").load(path)
   // => 2 rows, one VARIANT column named "Records": {"a":1}, {"b":2}
   ```
   
   It also adds three new error conditions: 
`EXPLODE_EMBEDDED_ARRAY_CONFLICTING_OPTION`, 
`EXPLODE_EMBEDDED_ARRAY_UNSUPPORTED_USAGE`, and 
`INVALID_EXPLODE_EMBEDDED_ARRAY_SCHEMA`.
   
   ### How was this patch tested?
   
   New `EmbeddedArraySplitterSuite` (splitter unit tests, including custom 
array field names, nesting/escapes, scalars, BOM, truncated input, and inputs 
larger than the internal buffer) and 
`ExplodeEmbeddedArrayJsonV1Suite`/`ExplodeEmbeddedArrayJsonV2Suite` 
(end-to-end: schema inference/validation, array field names other than 
`Records`, user schemas that rename the variant column, malformed records in 
all parse modes, truncated/compressed/partitioned/streaming inputs, option 
conflicts, the `from_json`/`json(Dataset[String])` rejections, and equivalence 
with reading the records as ndjson with `singleVariantColumn`). `JsonSuite`'s 
option-validation test was updated for the new option. All pass.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Generated-by: Claude Code (Opus 4.8)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to