Is it not enough to set `maxColumns` in CSV options? https://github.com/apache/spark/blob/branch-2.1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L116
// maropu On Wed, Jun 7, 2017 at 9:45 AM, Jörn Franke <jornfra...@gmail.com> wrote: > Spark CSV data source should be able > > On 7. Jun 2017, at 17:50, Chanh Le <giaosu...@gmail.com> wrote: > > Hi everyone, > I am using Spark 2.1.1 to read csv files and convert to avro files. > One problem that I am facing is if one row of csv file has more columns > than maxColumns (default is 20480). The process of parsing was stop. > > Internal state when error was thrown: line=1, column=3, record=0, > charIndex=12 > com.univocity.parsers.common.TextParsingException: > java.lang.ArrayIndexOutOfBoundsException > - 2 > Hint: Number of columns processed may have exceeded limit of 2 columns. > Use settings.setMaxColumns(int) to define the maximum number of columns > your input can have > Ensure your configuration is correct, with delimiters, quotes and escape > sequences that match the input format you are trying to parse > Parser Configuration: CsvParserSettings: > > > I did some investigation in univocity > <https://github.com/uniVocity/univocity-parsers> library but the way it > handle is throw error that why spark stop the process. > > How to skip the invalid row and just continue to parse next valid one? > Any libs can replace univocity in that job? > > Thanks & regards, > Chanh > -- > Regards, > Chanh > > -- --- Takeshi Yamamuro