Re: [PR] [SPARK-57479][SQL] Read and infer XML schema from tar archives [spark]

via GitHub Wed, 17 Jun 2026 15:42:46 -0700


akshatshenoi-db commented on PR #56572:
URL: https://github.com/apache/spark/pull/56572#issuecomment-4736159999


   ## AI code review (self-review via spark-dev)
   
   Ran an automated code review at head `fc6c690`.
   
   **Verdict: 0 blocking, 0 non-blocking, 0 nits — clean.**
   
   Checked:
   - `readArchive` and `inferWithArchives` are both per-mode and mirror 
`readFile` / the JSON analogue — single-line splits each entry/file into line 
records, multi-line tokenizes the whole stream into `rowTag`-delimited records.
   - The shared-cursor parse-before-advance invariant holds through the 
`perInput` helper (lazy tokenization, line strings copied before the entry 
cursor advances).
   - Single-line inference reproduces `readFile`'s line decode exactly, so a 
single-line archive read infers and scans the same as a single-line directory 
read.
   - Corrupt/missing inputs are skipped as a unit under 
`ignoreCorruptFiles`/`ignoreMissingFiles`.
   
   Added a single-line read+infer parity test. Not built locally; CI is the 
gate.
   
   <!-- ai-code-review -->


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-57479][SQL] Read and infer XML schema from tar archives [spark]

Reply via email to