[GitHub] [spark] Hisoka-X opened a new pull request, #42979: [SPARK-45035][SQL] Fix ignoreCorruptFiles with multiline CSV/JSON will report error

via GitHub Mon, 18 Sep 2023 05:27:15 -0700


Hisoka-X opened a new pull request, #42979:
URL: https://github.com/apache/spark/pull/42979

<!--
Thanks for sending a pull request! Here are some tips for you:
1. If this is your first time, please read our contributor guidelines:
https://spark.apache.org/contributing.html
2. Ensure you have added or run the appropriate tests for your PR:
https://spark.apache.org/developer-tools.html
3. If the PR is unfinished, add '[WIP]' in your PR title, e.g.,
'[WIP][SPARK-XXXX] Your PR title ...'.
4. Be sure to keep the PR description updated to reflect all changes.
5. Please write your PR title to summarize what this PR proposes.
6. If possible, provide a concise example to reproduce the issue for a
faster review.
7. If you want to add a new configuration, please read the guideline first
for naming configurations in

'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
8. If you want to add or modify an error type or message, please read the
guideline first in
'core/src/main/resources/error/README.md'.
-->

### What changes were proposed in this pull request?
Fix ignoreCorruptFiles with multiline CSV/JSON will report error, it would
be like:
```log
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in
stage 4940.0 failed 4 times, most recent failure: Lost task 0.3 in stage 4940.0
(TID 4031) (10.68.177.106 executor 0):
com.univocity.parsers.common.TextParsingException:
java.lang.IllegalStateException - Error reading from input
Parser Configuration: CsvParserSettings:
Auto configuration enabled=true
Auto-closing enabled=true
Autodetect column delimiter=false
Autodetect quotes=false
Column reordering enabled=true
Delimiters for detection=null
Empty value=
Escape unquoted values=false
Header extraction enabled=null
Headers=null
Ignore leading whitespaces=false
Ignore leading whitespaces in quotes=false
Ignore trailing whitespaces=false
Ignore trailing whitespaces in quotes=false
Input buffer size=1048576
Input reading on separate thread=false
Keep escape sequences=false
Keep quotes=false
Length of content displayed on error=1000
Line separator detection enabled=true
Maximum number of characters per column=-1
Maximum number of columns=20480
Normalize escaped line separators=true
Null value=
Number of records to read=all
Processor=none
Restricting data in exceptions=false
RowProcessor error handler=null
Selected fields=none
Skip bits as whitespace=true
Skip empty lines=true
Unescaped quote handling=STOP_AT_DELIMITERFormat configuration:
CsvFormat:
Comment character=#
Field delimiter=,
Line separator (normalized)=\n
Line separator sequence=\n
Quote character="
Quote escape character=\
Quote escape escape character=null
Internal state when error was thrown: line=0, column=0, record=0
at
com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:402)
at
com.univocity.parsers.common.AbstractParser.beginParsing(AbstractParser.java:277)
at
com.univocity.parsers.common.AbstractParser.beginParsing(AbstractParser.java:843)
at
org.apache.spark.sql.catalyst.csv.UnivocityParser$$anon$1.<init>(UnivocityParser.scala:463)
at
org.apache.spark.sql.catalyst.csv.UnivocityParser$.convertStream(UnivocityParser.scala:46...

```
Because multiline CSV/JSON use `BinaryFileRDD` not `FileScanRDD`. Unlike
`FileScanRDD`, when met corrupt files will check `ignoreCorruptFiles` config to
avoid report IOException, `BinaryFileRDD` will not report error because it
return normal `PortableDataStream`. So we should catch it when infer schema
lambda function.

### Why are the changes needed?
Fix the bug when use mulitline mode with ignoreCorruptFiles config.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
add new test.

### Was this patch authored or co-authored using generative AI tooling?
No

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] Hisoka-X opened a new pull request, #42979: [SPARK-45035][SQL] Fix ignoreCorruptFiles with multiline CSV/JSON will report error

Reply via email to