Re: [PR] [SPARK-57195][SQL] Surface MALFORMED_CSV_RECORD instead of ArrayIndexOutOfBoundsException in CSV schema inference [spark]

via GitHub Wed, 10 Jun 2026 16:26:32 -0700


cloud-fan commented on code in PR #56260:
URL: https://github.com/apache/spark/pull/56260#discussion_r3392221339



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala:
##########
@@ -634,7 +623,20 @@ private[sql] object UnivocityParser {
     // We can handle header here since here the stream is open.
     handleHeader()
 
-    private var nextRecord = tokenizer.parseNext()
+    // Streaming parse path used by CSV schema inference and multi-line reads. 
The raw record text
+    // is not available here, so the MALFORMED_CSV_RECORD error reports an 
empty bad record.
+    private def parseNextRecord(): Array[String] = {

Review Comment:
   Should `handleHeader()` (just above) be covered by the same translation? On 
the multiLine read path it ends in a raw `tokenizer.parseNext()` 
(`CSVHeaderChecker.scala:133`), so a header row exceeding `maxColumns` still 
surfaces the raw `ArrayIndexOutOfBoundsException` this PR is eliminating — 
likewise the non-multiLine header parse at `CSVHeaderChecker.scala:149`. The 
gap is pre-existing, so fine to punt, but worth either covering it here or 
noting why headers are out of scope.



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala:
##########
@@ -643,11 +645,39 @@ private[sql] object UnivocityParser {
         throw QueryExecutionErrors.endOfStreamError()
       }
       val curRecord = convert(nextRecord)
-      nextRecord = tokenizer.parseNext()
+      nextRecord = parseNextRecord()
       curRecord
     }
   }
 
+  /**
+   * Builds the user-facing MALFORMED_CSV_RECORD error raised when Univocity 
hits an
+   * ArrayIndexOutOfBoundsException for a malformed record (e.g. more columns 
than `maxColumns`).
+   * Univocity raises it either bare or wrapped in a TextParsingException 
depending on the call.
+   * SPARK-49444 fixed the per-line path; this also covers the streaming path.
+   */
+  private def malformedCsvRecord(cause: Throwable, badRecord: String): 
SparkRuntimeException =
+    new SparkRuntimeException(
+      errorClass = "MALFORMED_CSV_RECORD",
+      messageParameters = Map("badRecord" -> badRecord),

Review Comment:
   This embeds `badRecord` into the error message unbounded, and the 
line-routed paths pass the entire raw line. The catch can't distinguish a 
`maxColumns` overflow from a `maxCharsPerColumn` one — the SPARK-28431 test 
failures on 545ba24 demonstrated that — so an oversized value on a data row 
during non-multiLine inference now yields a `MALFORMED_CSV_RECORD` message 
carrying the full line, potentially tens of MB. That re-introduces on this path 
exactly what SPARK-28431 bounded to 1000 chars, and what your last commit 
preserved for the first-line parse only. I'd truncate here in the helper (e.g. 
`badRecord.take(1000)` plus a truncation marker — univocity uses the same bound 
via `setErrorContentLength`). That also fixes the same latent issue on the 
per-line read path from SPARK-49444, which is reachable with an explicit schema.
   
   Related, same surface: the streaming path passes `""` because the raw record 
text is unavailable — but `TextParsingException.getParsedContent()` is bounded 
to 1000 chars by our parser settings, so it may give a usable bad-record 
approximation. Did you check whether it's populated at the overflow point?



##########
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala:
##########
@@ -3506,6 +3506,53 @@ abstract class CSVSuite
     
assert(textParsingException.getCause.isInstanceOf[ArrayIndexOutOfBoundsException])
   }
 
+  test("SPARK-57195: multiLine CSV schema inference surfaces 
MALFORMED_CSV_RECORD for a row " +

Review Comment:
   Both new tests cover inference, but the multiLine read path (`parseStream`) 
this PR also guards is untested. A multiLine read with an explicit schema and 
an overflow row — the multiLine twin of the SPARK-49444 test above — would pin 
that behavior, including the `FAILED_READ_FILE` wrapping.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-57195][SQL] Surface MALFORMED_CSV_RECORD instead of ArrayIndexOutOfBoundsException in CSV schema inference [spark]

Reply via email to