Akshat Shenoi created SPARK-57479:
-------------------------------------
Summary: [SQL] Read and infer XML schema from tar archives
Key: SPARK-57479
URL: https://issues.apache.org/jira/browse/SPARK-57479
Project: Spark
Issue Type: Improvement
Components: SQL
Affects Versions: 4.3.0
Reporter: Akshat Shenoi
Assignee: Akshat Shenoi
Fix For: 4.3.0
SPARK-57135 / SPARK-57321 added reading and schema inference for CSV files
packed in tar archives (.tar/.tar.gz/.tgz), and SPARK-57419 did the same for
JSON, gated by spark.sql.files.archive.reader.enabled. This extends the same
capability to the text data source.
When the flag is enabled, the V1 text source treats a tar archive as a
directory of its entries: each entry is streamed through the ArchiveReader
(never unpacked to disk) and read exactly like a standalone text file -- one
row per line, or a single row holding the whole entry when wholeText is set.
The whole archive is one non-splittable unit (isSplitable returns false for an
archive path).
Text has a fixed `value STRING` schema, so there is no schema inference.
Archive scanning is wired into the V1 file source only; the DSv2 reader is left
untouched.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]