cloud-fan commented on code in PR #56260:
URL: https://github.com/apache/spark/pull/56260#discussion_r3392221339
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala:
##########
@@ -634,7 +623,20 @@ private[sql] object UnivocityParser {
// We can handle header here since here the stream is open.
handleHeader()
- private var nextRecord = tokenizer.parseNext()
+ // Streaming parse path used by CSV schema inference and multi-line reads.
The raw record text
+ // is not available here, so the MALFORMED_CSV_RECORD error reports an
empty bad record.
+ private def parseNextRecord(): Array[String] = {
Review Comment:
Should `handleHeader()` (just above) be covered by the same translation? On
the multiLine read path it ends in a raw `tokenizer.parseNext()`
(`CSVHeaderChecker.scala:133`), so a header row exceeding `maxColumns` still
surfaces the raw `ArrayIndexOutOfBoundsException` this PR is eliminating —
likewise the non-multiLine header parse at `CSVHeaderChecker.scala:149`. The
gap is pre-existing, so fine to punt, but worth either covering it here or
noting why headers are out of scope.
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala:
##########
@@ -643,11 +645,39 @@ private[sql] object UnivocityParser {
throw QueryExecutionErrors.endOfStreamError()
}
val curRecord = convert(nextRecord)
- nextRecord = tokenizer.parseNext()
+ nextRecord = parseNextRecord()
curRecord
}
}
+ /**
+ * Builds the user-facing MALFORMED_CSV_RECORD error raised when Univocity
hits an
+ * ArrayIndexOutOfBoundsException for a malformed record (e.g. more columns
than `maxColumns`).
+ * Univocity raises it either bare or wrapped in a TextParsingException
depending on the call.
+ * SPARK-49444 fixed the per-line path; this also covers the streaming path.
+ */
+ private def malformedCsvRecord(cause: Throwable, badRecord: String):
SparkRuntimeException =
+ new SparkRuntimeException(
+ errorClass = "MALFORMED_CSV_RECORD",
+ messageParameters = Map("badRecord" -> badRecord),
Review Comment:
This embeds `badRecord` into the error message unbounded, and the
line-routed paths pass the entire raw line. The catch can't distinguish a
`maxColumns` overflow from a `maxCharsPerColumn` one — the SPARK-28431 test
failures on 545ba24 demonstrated that — so an oversized value on a data row
during non-multiLine inference now yields a `MALFORMED_CSV_RECORD` message
carrying the full line, potentially tens of MB. That re-introduces on this path
exactly what SPARK-28431 bounded to 1000 chars, and what your last commit
preserved for the first-line parse only. I'd truncate here in the helper (e.g.
`badRecord.take(1000)` plus a truncation marker — univocity uses the same bound
via `setErrorContentLength`). That also fixes the same latent issue on the
per-line read path from SPARK-49444, which is reachable with an explicit schema.
Related, same surface: the streaming path passes `""` because the raw record
text is unavailable — but `TextParsingException.getParsedContent()` is bounded
to 1000 chars by our parser settings, so it may give a usable bad-record
approximation. Did you check whether it's populated at the overflow point?
##########
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala:
##########
@@ -3506,6 +3506,53 @@ abstract class CSVSuite
assert(textParsingException.getCause.isInstanceOf[ArrayIndexOutOfBoundsException])
}
+ test("SPARK-57195: multiLine CSV schema inference surfaces
MALFORMED_CSV_RECORD for a row " +
Review Comment:
Both new tests cover inference, but the multiLine read path (`parseStream`)
this PR also guards is untested. A multiLine read with an explicit schema and
an overflow row — the multiLine twin of the SPARK-49444 test above — would pin
that behavior, including the `FAILED_READ_FILE` wrapping.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]