Re: [CSV] If number of columns of one row bigger than maxcolumns it stop the whole parsing process.

2017-06-09 Thread Chanh Le
Hi Takeshi,

Thank you very much.

Regards,
Chanh


On Thu, Jun 8, 2017 at 11:05 PM Takeshi Yamamuro 
wrote:

> I filed a jira about this issue:
> https://issues.apache.org/jira/browse/SPARK-21024
>
> On Thu, Jun 8, 2017 at 1:27 AM, Chanh Le  wrote:
>
>> Can you recommend one?
>>
>> Thanks.
>>
>> On Thu, Jun 8, 2017 at 2:47 PM Jörn Franke  wrote:
>>
>>> You can change the CSV parser library
>>>
>>> On 8. Jun 2017, at 08:35, Chanh Le  wrote:
>>>
>>>
>>> I did add mode -> DROPMALFORMED but it still couldn't ignore it because
>>> the error raise from the CSV library that Spark are using.
>>>
>>>
>>> On Thu, Jun 8, 2017 at 12:11 PM Jörn Franke 
>>> wrote:
>>>
 The CSV data source allows you to skip invalid lines - this should also
 include lines that have more than maxColumns. Choose mode "DROPMALFORMED"

 On 8. Jun 2017, at 03:04, Chanh Le  wrote:

 Hi Takeshi, Jörn Franke,

 The problem is even I increase the maxColumns it still have some lines
 have larger columns than the one I set and it will cost a lot of memory.
 So I just wanna skip the line has larger columns than the maxColumns I
 set.

 Regards,
 Chanh


 On Thu, Jun 8, 2017 at 12:48 AM Takeshi Yamamuro 
 wrote:

> Is it not enough to set `maxColumns` in CSV options?
>
>
> https://github.com/apache/spark/blob/branch-2.1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L116
>
> // maropu
>
> On Wed, Jun 7, 2017 at 9:45 AM, Jörn Franke 
> wrote:
>
>> Spark CSV data source should be able
>>
>> On 7. Jun 2017, at 17:50, Chanh Le  wrote:
>>
>> Hi everyone,
>> I am using Spark 2.1.1 to read csv files and convert to avro files.
>> One problem that I am facing is if one row of csv file has more
>> columns than maxColumns (default is 20480). The process of parsing
>> was stop.
>>
>> Internal state when error was thrown: line=1, column=3, record=0,
>> charIndex=12
>> com.univocity.parsers.common.TextParsingException:
>> java.lang.ArrayIndexOutOfBoundsException - 2
>> Hint: Number of columns processed may have exceeded limit of 2
>> columns. Use settings.setMaxColumns(int) to define the maximum number of
>> columns your input can have
>> Ensure your configuration is correct, with delimiters, quotes and
>> escape sequences that match the input format you are trying to parse
>> Parser Configuration: CsvParserSettings:
>>
>>
>> I did some investigation in univocity
>>  library but the way
>> it handle is throw error that why spark stop the process.
>>
>> How to skip the invalid row and just continue to parse next valid one?
>> Any libs can replace univocity in that job?
>>
>> Thanks & regards,
>> Chanh
>> --
>> Regards,
>> Chanh
>>
>>
>
>
> --
> ---
> Takeshi Yamamuro
>
 --
 Regards,
 Chanh

 --
>>> Regards,
>>> Chanh
>>>
>>> --
>> Regards,
>> Chanh
>>
>
>
>
> --
> ---
> Takeshi Yamamuro
>
-- 
Regards,
Chanh


Re: [CSV] If number of columns of one row bigger than maxcolumns it stop the whole parsing process.

2017-06-08 Thread Takeshi Yamamuro
I filed a jira about this issue:
https://issues.apache.org/jira/browse/SPARK-21024

On Thu, Jun 8, 2017 at 1:27 AM, Chanh Le  wrote:

> Can you recommend one?
>
> Thanks.
>
> On Thu, Jun 8, 2017 at 2:47 PM Jörn Franke  wrote:
>
>> You can change the CSV parser library
>>
>> On 8. Jun 2017, at 08:35, Chanh Le  wrote:
>>
>>
>> I did add mode -> DROPMALFORMED but it still couldn't ignore it because
>> the error raise from the CSV library that Spark are using.
>>
>>
>> On Thu, Jun 8, 2017 at 12:11 PM Jörn Franke  wrote:
>>
>>> The CSV data source allows you to skip invalid lines - this should also
>>> include lines that have more than maxColumns. Choose mode "DROPMALFORMED"
>>>
>>> On 8. Jun 2017, at 03:04, Chanh Le  wrote:
>>>
>>> Hi Takeshi, Jörn Franke,
>>>
>>> The problem is even I increase the maxColumns it still have some lines
>>> have larger columns than the one I set and it will cost a lot of memory.
>>> So I just wanna skip the line has larger columns than the maxColumns I
>>> set.
>>>
>>> Regards,
>>> Chanh
>>>
>>>
>>> On Thu, Jun 8, 2017 at 12:48 AM Takeshi Yamamuro 
>>> wrote:
>>>
 Is it not enough to set `maxColumns` in CSV options?

 https://github.com/apache/spark/blob/branch-2.1/sql/
 core/src/main/scala/org/apache/spark/sql/execution/
 datasources/csv/CSVOptions.scala#L116

 // maropu

 On Wed, Jun 7, 2017 at 9:45 AM, Jörn Franke 
 wrote:

> Spark CSV data source should be able
>
> On 7. Jun 2017, at 17:50, Chanh Le  wrote:
>
> Hi everyone,
> I am using Spark 2.1.1 to read csv files and convert to avro files.
> One problem that I am facing is if one row of csv file has more
> columns than maxColumns (default is 20480). The process of parsing
> was stop.
>
> Internal state when error was thrown: line=1, column=3, record=0,
> charIndex=12
> com.univocity.parsers.common.TextParsingException: 
> java.lang.ArrayIndexOutOfBoundsException
> - 2
> Hint: Number of columns processed may have exceeded limit of 2
> columns. Use settings.setMaxColumns(int) to define the maximum number of
> columns your input can have
> Ensure your configuration is correct, with delimiters, quotes and
> escape sequences that match the input format you are trying to parse
> Parser Configuration: CsvParserSettings:
>
>
> I did some investigation in univocity
>  library but the way
> it handle is throw error that why spark stop the process.
>
> How to skip the invalid row and just continue to parse next valid one?
> Any libs can replace univocity in that job?
>
> Thanks & regards,
> Chanh
> --
> Regards,
> Chanh
>
>


 --
 ---
 Takeshi Yamamuro

>>> --
>>> Regards,
>>> Chanh
>>>
>>> --
>> Regards,
>> Chanh
>>
>> --
> Regards,
> Chanh
>



-- 
---
Takeshi Yamamuro


Re: [CSV] If number of columns of one row bigger than maxcolumns it stop the whole parsing process.

2017-06-08 Thread Chanh Le
Can you recommend one?

Thanks.

On Thu, Jun 8, 2017 at 2:47 PM Jörn Franke  wrote:

> You can change the CSV parser library
>
> On 8. Jun 2017, at 08:35, Chanh Le  wrote:
>
>
> I did add mode -> DROPMALFORMED but it still couldn't ignore it because
> the error raise from the CSV library that Spark are using.
>
>
> On Thu, Jun 8, 2017 at 12:11 PM Jörn Franke  wrote:
>
>> The CSV data source allows you to skip invalid lines - this should also
>> include lines that have more than maxColumns. Choose mode "DROPMALFORMED"
>>
>> On 8. Jun 2017, at 03:04, Chanh Le  wrote:
>>
>> Hi Takeshi, Jörn Franke,
>>
>> The problem is even I increase the maxColumns it still have some lines
>> have larger columns than the one I set and it will cost a lot of memory.
>> So I just wanna skip the line has larger columns than the maxColumns I
>> set.
>>
>> Regards,
>> Chanh
>>
>>
>> On Thu, Jun 8, 2017 at 12:48 AM Takeshi Yamamuro 
>> wrote:
>>
>>> Is it not enough to set `maxColumns` in CSV options?
>>>
>>>
>>> https://github.com/apache/spark/blob/branch-2.1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L116
>>>
>>> // maropu
>>>
>>> On Wed, Jun 7, 2017 at 9:45 AM, Jörn Franke 
>>> wrote:
>>>
 Spark CSV data source should be able

 On 7. Jun 2017, at 17:50, Chanh Le  wrote:

 Hi everyone,
 I am using Spark 2.1.1 to read csv files and convert to avro files.
 One problem that I am facing is if one row of csv file has more columns
 than maxColumns (default is 20480). The process of parsing was stop.

 Internal state when error was thrown: line=1, column=3, record=0,
 charIndex=12
 com.univocity.parsers.common.TextParsingException:
 java.lang.ArrayIndexOutOfBoundsException - 2
 Hint: Number of columns processed may have exceeded limit of 2 columns.
 Use settings.setMaxColumns(int) to define the maximum number of columns
 your input can have
 Ensure your configuration is correct, with delimiters, quotes and
 escape sequences that match the input format you are trying to parse
 Parser Configuration: CsvParserSettings:


 I did some investigation in univocity
  library but the way
 it handle is throw error that why spark stop the process.

 How to skip the invalid row and just continue to parse next valid one?
 Any libs can replace univocity in that job?

 Thanks & regards,
 Chanh
 --
 Regards,
 Chanh


>>>
>>>
>>> --
>>> ---
>>> Takeshi Yamamuro
>>>
>> --
>> Regards,
>> Chanh
>>
>> --
> Regards,
> Chanh
>
> --
Regards,
Chanh


Re: [CSV] If number of columns of one row bigger than maxcolumns it stop the whole parsing process.

2017-06-08 Thread Jörn Franke
You can change the CSV parser library 

> On 8. Jun 2017, at 08:35, Chanh Le  wrote:
> 
> 
> I did add mode -> DROPMALFORMED but it still couldn't ignore it because the 
> error raise from the CSV library that Spark are using.
> 
> 
>> On Thu, Jun 8, 2017 at 12:11 PM Jörn Franke  wrote:
>> The CSV data source allows you to skip invalid lines - this should also 
>> include lines that have more than maxColumns. Choose mode "DROPMALFORMED"
>> 
>>> On 8. Jun 2017, at 03:04, Chanh Le  wrote:
>>> 
>>> Hi Takeshi, Jörn Franke,
>>> 
>>> The problem is even I increase the maxColumns it still have some lines have 
>>> larger columns than the one I set and it will cost a lot of memory.
>>> So I just wanna skip the line has larger columns than the maxColumns I set.
>>> 
>>> Regards,
>>> Chanh
>>> 
>>> 
 On Thu, Jun 8, 2017 at 12:48 AM Takeshi Yamamuro  
 wrote:
 Is it not enough to set `maxColumns` in CSV options?
 
 https://github.com/apache/spark/blob/branch-2.1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L116
 
 // maropu
 
> On Wed, Jun 7, 2017 at 9:45 AM, Jörn Franke  wrote:
> Spark CSV data source should be able
> 
>> On 7. Jun 2017, at 17:50, Chanh Le  wrote:
>> 
>> Hi everyone,
>> I am using Spark 2.1.1 to read csv files and convert to avro files.
>> One problem that I am facing is if one row of csv file has more columns 
>> than maxColumns (default is 20480). The process of parsing was stop.
>> 
>> Internal state when error was thrown: line=1, column=3, record=0, 
>> charIndex=12
>> com.univocity.parsers.common.TextParsingException: 
>> java.lang.ArrayIndexOutOfBoundsException - 2
>> Hint: Number of columns processed may have exceeded limit of 2 columns. 
>> Use settings.setMaxColumns(int) to define the maximum number of columns 
>> your input can have
>> Ensure your configuration is correct, with delimiters, quotes and escape 
>> sequences that match the input format you are trying to parse
>> Parser Configuration: CsvParserSettings:
>> 
>> 
>> I did some investigation in univocity library but the way it handle is 
>> throw error that why spark stop the process.
>> 
>> How to skip the invalid row and just continue to parse next valid one?
>> Any libs can replace univocity in that job?
>> 
>> Thanks & regards,
>> Chanh
>> -- 
>> Regards,
>> Chanh
 
 
 
 -- 
 ---
 Takeshi Yamamuro
>>> 
>>> -- 
>>> Regards,
>>> Chanh
> 
> -- 
> Regards,
> Chanh


Re: [CSV] If number of columns of one row bigger than maxcolumns it stop the whole parsing process.

2017-06-08 Thread Chanh Le
I did add mode -> DROPMALFORMED but it still couldn't ignore it because the
error raise from the CSV library that Spark are using.


On Thu, Jun 8, 2017 at 12:11 PM Jörn Franke  wrote:

> The CSV data source allows you to skip invalid lines - this should also
> include lines that have more than maxColumns. Choose mode "DROPMALFORMED"
>
> On 8. Jun 2017, at 03:04, Chanh Le  wrote:
>
> Hi Takeshi, Jörn Franke,
>
> The problem is even I increase the maxColumns it still have some lines
> have larger columns than the one I set and it will cost a lot of memory.
> So I just wanna skip the line has larger columns than the maxColumns I set.
>
> Regards,
> Chanh
>
>
> On Thu, Jun 8, 2017 at 12:48 AM Takeshi Yamamuro 
> wrote:
>
>> Is it not enough to set `maxColumns` in CSV options?
>>
>>
>> https://github.com/apache/spark/blob/branch-2.1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L116
>>
>> // maropu
>>
>> On Wed, Jun 7, 2017 at 9:45 AM, Jörn Franke  wrote:
>>
>>> Spark CSV data source should be able
>>>
>>> On 7. Jun 2017, at 17:50, Chanh Le  wrote:
>>>
>>> Hi everyone,
>>> I am using Spark 2.1.1 to read csv files and convert to avro files.
>>> One problem that I am facing is if one row of csv file has more columns
>>> than maxColumns (default is 20480). The process of parsing was stop.
>>>
>>> Internal state when error was thrown: line=1, column=3, record=0,
>>> charIndex=12
>>> com.univocity.parsers.common.TextParsingException:
>>> java.lang.ArrayIndexOutOfBoundsException - 2
>>> Hint: Number of columns processed may have exceeded limit of 2 columns.
>>> Use settings.setMaxColumns(int) to define the maximum number of columns
>>> your input can have
>>> Ensure your configuration is correct, with delimiters, quotes and escape
>>> sequences that match the input format you are trying to parse
>>> Parser Configuration: CsvParserSettings:
>>>
>>>
>>> I did some investigation in univocity
>>>  library but the way it
>>> handle is throw error that why spark stop the process.
>>>
>>> How to skip the invalid row and just continue to parse next valid one?
>>> Any libs can replace univocity in that job?
>>>
>>> Thanks & regards,
>>> Chanh
>>> --
>>> Regards,
>>> Chanh
>>>
>>>
>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
> --
> Regards,
> Chanh
>
> --
Regards,
Chanh


Re: [CSV] If number of columns of one row bigger than maxcolumns it stop the whole parsing process.

2017-06-07 Thread Jörn Franke
The CSV data source allows you to skip invalid lines - this should also include 
lines that have more than maxColumns. Choose mode "DROPMALFORMED"

> On 8. Jun 2017, at 03:04, Chanh Le  wrote:
> 
> Hi Takeshi, Jörn Franke,
> 
> The problem is even I increase the maxColumns it still have some lines have 
> larger columns than the one I set and it will cost a lot of memory.
> So I just wanna skip the line has larger columns than the maxColumns I set.
> 
> Regards,
> Chanh
> 
> 
>> On Thu, Jun 8, 2017 at 12:48 AM Takeshi Yamamuro  
>> wrote:
>> Is it not enough to set `maxColumns` in CSV options?
>> 
>> https://github.com/apache/spark/blob/branch-2.1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L116
>> 
>> // maropu
>> 
>>> On Wed, Jun 7, 2017 at 9:45 AM, Jörn Franke  wrote:
>>> Spark CSV data source should be able
>>> 
 On 7. Jun 2017, at 17:50, Chanh Le  wrote:
 
 Hi everyone,
 I am using Spark 2.1.1 to read csv files and convert to avro files.
 One problem that I am facing is if one row of csv file has more columns 
 than maxColumns (default is 20480). The process of parsing was stop.
 
 Internal state when error was thrown: line=1, column=3, record=0, 
 charIndex=12
 com.univocity.parsers.common.TextParsingException: 
 java.lang.ArrayIndexOutOfBoundsException - 2
 Hint: Number of columns processed may have exceeded limit of 2 columns. 
 Use settings.setMaxColumns(int) to define the maximum number of columns 
 your input can have
 Ensure your configuration is correct, with delimiters, quotes and escape 
 sequences that match the input format you are trying to parse
 Parser Configuration: CsvParserSettings:
 
 
 I did some investigation in univocity library but the way it handle is 
 throw error that why spark stop the process.
 
 How to skip the invalid row and just continue to parse next valid one?
 Any libs can replace univocity in that job?
 
 Thanks & regards,
 Chanh
 -- 
 Regards,
 Chanh
>> 
>> 
>> 
>> -- 
>> ---
>> Takeshi Yamamuro
> 
> -- 
> Regards,
> Chanh


Re: [CSV] If number of columns of one row bigger than maxcolumns it stop the whole parsing process.

2017-06-07 Thread Chanh Le
Hi Takeshi, Jörn Franke,

The problem is even I increase the maxColumns it still have some lines have
larger columns than the one I set and it will cost a lot of memory.
So I just wanna skip the line has larger columns than the maxColumns I set.

Regards,
Chanh


On Thu, Jun 8, 2017 at 12:48 AM Takeshi Yamamuro 
wrote:

> Is it not enough to set `maxColumns` in CSV options?
>
>
> https://github.com/apache/spark/blob/branch-2.1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L116
>
> // maropu
>
> On Wed, Jun 7, 2017 at 9:45 AM, Jörn Franke  wrote:
>
>> Spark CSV data source should be able
>>
>> On 7. Jun 2017, at 17:50, Chanh Le  wrote:
>>
>> Hi everyone,
>> I am using Spark 2.1.1 to read csv files and convert to avro files.
>> One problem that I am facing is if one row of csv file has more columns
>> than maxColumns (default is 20480). The process of parsing was stop.
>>
>> Internal state when error was thrown: line=1, column=3, record=0,
>> charIndex=12
>> com.univocity.parsers.common.TextParsingException:
>> java.lang.ArrayIndexOutOfBoundsException - 2
>> Hint: Number of columns processed may have exceeded limit of 2 columns.
>> Use settings.setMaxColumns(int) to define the maximum number of columns
>> your input can have
>> Ensure your configuration is correct, with delimiters, quotes and escape
>> sequences that match the input format you are trying to parse
>> Parser Configuration: CsvParserSettings:
>>
>>
>> I did some investigation in univocity
>>  library but the way it
>> handle is throw error that why spark stop the process.
>>
>> How to skip the invalid row and just continue to parse next valid one?
>> Any libs can replace univocity in that job?
>>
>> Thanks & regards,
>> Chanh
>> --
>> Regards,
>> Chanh
>>
>>
>
>
> --
> ---
> Takeshi Yamamuro
>
-- 
Regards,
Chanh


Re: [CSV] If number of columns of one row bigger than maxcolumns it stop the whole parsing process.

2017-06-07 Thread Takeshi Yamamuro
Is it not enough to set `maxColumns` in CSV options?

https://github.com/apache/spark/blob/branch-2.1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L116

// maropu

On Wed, Jun 7, 2017 at 9:45 AM, Jörn Franke  wrote:

> Spark CSV data source should be able
>
> On 7. Jun 2017, at 17:50, Chanh Le  wrote:
>
> Hi everyone,
> I am using Spark 2.1.1 to read csv files and convert to avro files.
> One problem that I am facing is if one row of csv file has more columns
> than maxColumns (default is 20480). The process of parsing was stop.
>
> Internal state when error was thrown: line=1, column=3, record=0,
> charIndex=12
> com.univocity.parsers.common.TextParsingException: 
> java.lang.ArrayIndexOutOfBoundsException
> - 2
> Hint: Number of columns processed may have exceeded limit of 2 columns.
> Use settings.setMaxColumns(int) to define the maximum number of columns
> your input can have
> Ensure your configuration is correct, with delimiters, quotes and escape
> sequences that match the input format you are trying to parse
> Parser Configuration: CsvParserSettings:
>
>
> I did some investigation in univocity
>  library but the way it
> handle is throw error that why spark stop the process.
>
> How to skip the invalid row and just continue to parse next valid one?
> Any libs can replace univocity in that job?
>
> Thanks & regards,
> Chanh
> --
> Regards,
> Chanh
>
>


-- 
---
Takeshi Yamamuro


Re: [CSV] If number of columns of one row bigger than maxcolumns it stop the whole parsing process.

2017-06-07 Thread Jörn Franke
Spark CSV data source should be able

> On 7. Jun 2017, at 17:50, Chanh Le  wrote:
> 
> Hi everyone,
> I am using Spark 2.1.1 to read csv files and convert to avro files.
> One problem that I am facing is if one row of csv file has more columns than 
> maxColumns (default is 20480). The process of parsing was stop.
> 
> Internal state when error was thrown: line=1, column=3, record=0, charIndex=12
> com.univocity.parsers.common.TextParsingException: 
> java.lang.ArrayIndexOutOfBoundsException - 2
> Hint: Number of columns processed may have exceeded limit of 2 columns. Use 
> settings.setMaxColumns(int) to define the maximum number of columns your 
> input can have
> Ensure your configuration is correct, with delimiters, quotes and escape 
> sequences that match the input format you are trying to parse
> Parser Configuration: CsvParserSettings:
> 
> 
> I did some investigation in univocity library but the way it handle is throw 
> error that why spark stop the process.
> 
> How to skip the invalid row and just continue to parse next valid one?
> Any libs can replace univocity in that job?
> 
> Thanks & regards,
> Chanh
> -- 
> Regards,
> Chanh