[ 
https://issues.apache.org/jira/browse/PHOENIX-7267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nihal Jain updated PHOENIX-7267:
--------------------------------
    Labels: bulkload  (was: )

> CsvBulkLoadTool fails for a bad record with "(startline 1) EOF reached before 
> encapsulated token finished"
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: PHOENIX-7267
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-7267
>             Project: Phoenix
>          Issue Type: Bug
>    Affects Versions: 5.2.0, 5.1.3, 5.3.0
>            Reporter: Nihal Jain
>            Assignee: Nihal Jain
>            Priority: Major
>              Labels: bulkload
>
> We are trying to load data where there are few bad record for some files due 
> to which mappers fail and hence the entire job fail with following error:
> {code:java}
> Error: java.lang.RuntimeException: java.lang.RuntimeException: 
> java.io.IOException: (startline 1) EOF reached before encapsulated token 
> finished
>       at 
> org.apache.phoenix.mapreduce.FormatToBytesWritableMapper.map(FormatToBytesWritableMapper.java:206)
>       at 
> org.apache.phoenix.mapreduce.FormatToBytesWritableMapper.map(FormatToBytesWritableMapper.java:77)
>       at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
>       at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:799)
>       at org.apache.hadoop.mapred.MapTask.run(MapTask.java:347)
>       at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at javax.security.auth.Subject.doAs(Subject.java:422)
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
>       at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168)
> Caused by: java.lang.RuntimeException: java.io.IOException: (startline 1) EOF 
> reached before encapsulated token finished
>       at org.apache.commons.csv.CSVParser$1.getNextRecord(CSVParser.java:398)
>       at org.apache.commons.csv.CSVParser$1.hasNext(CSVParser.java:407)
>       at 
> org.apache.phoenix.thirdparty.com.google.common.collect.Iterators.getNext(Iterators.java:895)
>       at 
> org.apache.phoenix.thirdparty.com.google.common.collect.Iterables.getFirst(Iterables.java:827)
>       at 
> org.apache.phoenix.mapreduce.CsvToKeyValueMapper$CsvLineParser.parse(CsvToKeyValueMapper.java:109)
>       at 
> org.apache.phoenix.mapreduce.CsvToKeyValueMapper$CsvLineParser.parse(CsvToKeyValueMapper.java:91)
>       at 
> org.apache.phoenix.mapreduce.FormatToBytesWritableMapper.map(FormatToBytesWritableMapper.java:164)
>       ... 9 more
> Caused by: java.io.IOException: (startline 1) EOF reached before encapsulated 
> token finished
>       at org.apache.commons.csv.Lexer.parseEncapsulatedToken(Lexer.java:282)
>       at org.apache.commons.csv.Lexer.nextToken(Lexer.java:152)
>       at org.apache.commons.csv.CSVParser.nextRecord(CSVParser.java:450)
>       at org.apache.commons.csv.CSVParser$1.getNextRecord(CSVParser.java:395)
>       ... 15 more {code}
> I have figured out there is code in commons-csv which throws a 
> RuntimeException when it fails to parse a record, which in turn is not 
> handled by phoenix as we only catch IOException. 
> See 
> [https://github.com/apache/commons-csv/blob/rel/commons-csv-1.0/src/main/java/org/apache/commons/csv/CSVParser.java#L398]
>  
> Also see 
> [https://github.com/apache/phoenix/blob/master/phoenix-core-server/src/main/java/org/apache/phoenix/mapreduce/FormatToBytesWritableMapper.java#L167]
>  
> This is undesired, in worst case the job should just skip the failed record 
> than the whole job. Note we are passing --ignore-errors.
> This bug is to fix this behavior and figure out a way to handle the failed 
> records and make the job continue. Also will bump commons-csv to 1.10.0, 
> seems quite a while we have not bumped it. Better to move up here as well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to