Akshat Shenoi created SPARK-57478:
-------------------------------------
Summary: [SQL] Read text files from tar archives
Key: SPARK-57478
URL: https://issues.apache.org/jira/browse/SPARK-57478
Project: Spark
Issue Type: Improvement
Components: SQL
Affects Versions: 4.3.0
Reporter: Akshat Shenoi
Assignee: Akshat Shenoi
Fix For: 4.3.0
SPARK-57135 added support for reading CSV files packed in tar archives
(.tar/.tar.gz/.tgz) and SPARK-57321 added schema inference for them, both gated
by spark.sql.files.archive.reader.enabled; this extends the same capability to
the JSON data source. When spark.sql.files.archive.reader.enabled is true, the
V1 JSON data source reads a tar archive as if it were a directory of its
entries: each entry is streamed through ArchiveReader (never unpacked to disk)
and parsed exactly like a standalone JSON file, for both line-delimited and
multi-line JSON. Schema inference reads every archive entry together with any
loose files alongside it in a single JsonInferSchema pass, so the inferred
schema matches a directory read of the same files. The whole archive is a
single non-splittable unit, and a corrupt/missing archive is skipped as a unit
under ignoreCorruptFiles/ignoreMissingFiles. The DSv2 JSON reader does not
support archives, so it refuses to infer a schema for archive inputs (raising
UNABLE_TO_INFER_SCHEMA) rather than mis-reading raw archive bytes. Unlike CSV,
JSON needs no per-entry header handling (records are self-describing, so one
parser serves every entry) and no mergeSchema-style branching (JsonInferSchema
already merges record types by field name across all inputs, so one pass is
itself the union). This change also unifies the archive test suites: the
format-agnostic inference and complex-type tests are hoisted into
ArchiveReadSuiteBase behind capability hooks (supportsSchemaInference,
supportsComplexTypes) so CSV, JSON, and future archive formats share them
rather than each duplicating them.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]