Nihal Jain created PHOENIX-7267:
-----------------------------------

             Summary: CsvBulkLoadTool fails for a bad record with "(startline 
1) EOF reached before encapsulated token finished"
                 Key: PHOENIX-7267
                 URL: https://issues.apache.org/jira/browse/PHOENIX-7267
             Project: Phoenix
          Issue Type: Bug
    Affects Versions: 5.1.3, 5.2.0, 5.3.0
            Reporter: Nihal Jain
            Assignee: Nihal Jain


We are trying to load data where there are few bad record for some files due to 
which mappers fail and hence the entire job fail with following error:
{code:java}
Error: java.lang.RuntimeException: java.lang.RuntimeException: 
java.io.IOException: (startline 1) EOF reached before encapsulated token 
finished
        at 
org.apache.phoenix.mapreduce.FormatToBytesWritableMapper.map(FormatToBytesWritableMapper.java:206)
        at 
org.apache.phoenix.mapreduce.FormatToBytesWritableMapper.map(FormatToBytesWritableMapper.java:77)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:799)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:347)
        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168)
Caused by: java.lang.RuntimeException: java.io.IOException: (startline 1) EOF 
reached before encapsulated token finished
        at org.apache.commons.csv.CSVParser$1.getNextRecord(CSVParser.java:398)
        at org.apache.commons.csv.CSVParser$1.hasNext(CSVParser.java:407)
        at 
org.apache.phoenix.thirdparty.com.google.common.collect.Iterators.getNext(Iterators.java:895)
        at 
org.apache.phoenix.thirdparty.com.google.common.collect.Iterables.getFirst(Iterables.java:827)
        at 
org.apache.phoenix.mapreduce.CsvToKeyValueMapper$CsvLineParser.parse(CsvToKeyValueMapper.java:109)
        at 
org.apache.phoenix.mapreduce.CsvToKeyValueMapper$CsvLineParser.parse(CsvToKeyValueMapper.java:91)
        at 
org.apache.phoenix.mapreduce.FormatToBytesWritableMapper.map(FormatToBytesWritableMapper.java:164)
        ... 9 more
Caused by: java.io.IOException: (startline 1) EOF reached before encapsulated 
token finished
        at org.apache.commons.csv.Lexer.parseEncapsulatedToken(Lexer.java:282)
        at org.apache.commons.csv.Lexer.nextToken(Lexer.java:152)
        at org.apache.commons.csv.CSVParser.nextRecord(CSVParser.java:450)
        at org.apache.commons.csv.CSVParser$1.getNextRecord(CSVParser.java:395)
        ... 15 more {code}
I have figured out there is code in commons-csv which throws a RuntimeException 
when it fails to parse are record which is not handled by phoenix as we only 
catch IOException. 

See 
[https://github.com/apache/commons-csv/blob/rel/commons-csv-1.0/src/main/java/org/apache/commons/csv/CSVParser.java#L398]
 

Also see 
[https://github.com/apache/phoenix/blob/master/phoenix-core-server/src/main/java/org/apache/phoenix/mapreduce/FormatToBytesWritableMapper.java#L167]

 

This is undesired, in worst case the job should just skip the failed record 
than the whole job. Note we are passing --ignore-errors.

This bug is to fix this behavior.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to