[GitHub] spark issue #21892: [SPARK-24945][SQL] Switching to uniVocity 2.7.2

MaxGekk Tue, 31 Jul 2018 12:33:04 -0700

Github user MaxGekk commented on the issue:

    https://github.com/apache/spark/pull/21892
  
    @jbax I got the following exception on **2.7.3-SNAPSHOT** (commit 
e51b0958a):
    ```
    Internal state when error was thrown: line=20, column=20481, record=20, 
charIndex=82594, headers=[col0,..., col999]
        at 
com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:369)
        at 
com.univocity.parsers.common.AbstractParser.parseLine(AbstractParser.java:673)
        at 
org.apache.spark.sql.execution.datasources.csv.UnivocityParser.parse(UnivocityParser.scala:210)
        at 
org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$7.apply(UnivocityParser.scala:333)
        ...
    Caused by: java.lang.ArrayIndexOutOfBoundsException: 20480
        at 
com.univocity.parsers.common.ParserOutput.valueParsed(ParserOutput.java:316)
        at com.univocity.parsers.csv.CsvParser.parseRecord(CsvParser.java:160)
        at 
com.univocity.parsers.common.AbstractParser.parseLine(AbstractParser.java:654)
        ... 23 more
    ```
    This happened on a CSV file with 1000 columns with header and the set of 
selected indexes is empty. Our settings are:
    ```
    Parser Configuration: CsvParserSettings:
        Auto configuration enabled=true
        Autodetect column delimiter=false
        Autodetect quotes=false
        Column reordering enabled=true
        Delimiters for detection=null
        Empty value=
        Escape unquoted values=false
        Header extraction enabled=null
        Headers=null
        Ignore leading whitespaces=false
        Ignore leading whitespaces in quotes=false
        Ignore trailing whitespaces=false
        Ignore trailing whitespaces in quotes=false
        Input buffer size=128
        Input reading on separate thread=false
        Keep escape sequences=false
        Keep quotes=false
        Length of content displayed on error=-1
        Line separator detection enabled=false
        Maximum number of characters per column=-1
        Maximum number of columns=20480
        Normalize escaped line separators=true
        Null value=
        Number of records to read=all
        Processor=none
        Restricting data in exceptions=false
        RowProcessor error handler=null
        Selected fields=field selection: []
        Skip bits as whitespace=true
        Skip empty lines=true
        Unescaped quote handling=STOP_AT_DELIMITERFormat configuration:
        CsvFormat:
                Comment character=\0
                Field delimiter=,
                Line separator (normalized)=\n
                Line separator sequence=\n
                Quote character="
                Quote escape character=\
                Quote escape escape character=null
    ```
    
    Here is the input file (3.5GB uncompressed) - test.csv.xz (you need to 
change extension): 
    [test.csv.zip](https://github.com/apache/spark/files/2246796/test.csv.zip)




---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #21892: [SPARK-24945][SQL] Switching to uniVocity 2.7.2

Reply via email to