[jira] [Commented] (FLINK-5907) RowCsvInputFormat bug on parsing tsv

ASF GitHub Bot (JIRA) Mon, 27 Feb 2017 05:25:47 -0800

    [ 
https://issues.apache.org/jira/browse/FLINK-5907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15885778#comment-15885778
 ]


ASF GitHub Bot commented on FLINK-5907:
---------------------------------------

Github user fhueske commented on a diff in the pull request:

    https://github.com/apache/flink/pull/3417#discussion_r103203173
  
    --- Diff: 
flink-core/src/main/java/org/apache/flink/api/common/io/GenericCsvInputFormat.java
 ---
    @@ -358,24 +358,27 @@ protected boolean parseRecord(Object[] holders, 
byte[] bytes, int offset, int nu
                for (int field = 0, output = 0; field < fieldIncluded.length; 
field++) {
                        
                        // check valid start position
    -                   if (startPos >= limit) {
    +                   if (startPos > limit || (startPos == limit && field != 
fieldIncluded.length - 1)) {
                                if (lenient) {
                                        return false;
                                } else {
                                        throw new ParseException("Row too 
short: " + new String(bytes, offset, numBytes));
                                }
                        }
    -                   
    +
                        if (fieldIncluded[field]) {
                                // parse field
                                @SuppressWarnings("unchecked")
                                FieldParser<Object> parser = 
(FieldParser<Object>) this.fieldParsers[output];
                                Object reuse = holders[output];
                                startPos = 
parser.resetErrorStateAndParse(bytes, startPos, limit, this.fieldDelim, reuse);
                                holders[output] = parser.getLastResult();
    -                           
    +
                                // check parse result
    -                           if (startPos < 0) {
    +                           if (startPos < 0 ||
    +                                           (startPos == limit
    --- End diff --
    
    Move this condition into an `else if` branch and give a more detailed error 
message (row to short).
    Also add a comment that we read the whole records but that there are fields 
missing.


> RowCsvInputFormat bug on parsing tsv
> ------------------------------------
>
>                 Key: FLINK-5907
>                 URL: https://issues.apache.org/jira/browse/FLINK-5907
>             Project: Flink
>          Issue Type: Bug
>          Components: Java API
>    Affects Versions: 1.2.0
>            Reporter: Flavio Pompermaier
>            Assignee: Kurt Young
>              Labels: csv, parsing
>         Attachments: test.tsv
>
>
> The following snippet reproduce the problem (using the attached file as 
> input):
> {code:language=java}
> char fieldDelim = '\t';
>     TypeInformation<?>[] fieldTypes = new TypeInformation<?>[51];
>     for (int i = 0; i < fieldTypes.length; i++) {
>       fieldTypes[i] = BasicTypeInfo.STRING_TYPE_INFO;
>     }
>     int[] fieldMask = new int[fieldTypes.length];
>     for (int i = 0; i < fieldMask.length; i++) {
>       fieldMask[i] = i;
>     }
>     RowCsvInputFormat csvIF = new RowCsvInputFormat(new Path(testCsv), 
> fieldTypes, "\n", fieldDelim +"", 
>        fieldMask, true);
>     csvIF.setNestedFileEnumeration(true);
>     DataSet<Row> csv = env.createInput(csvIF);
>    csv.print()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (FLINK-5907) RowCsvInputFormat bug on parsing tsv

Reply via email to