Yubao Liu created PIG-3828:
------------------------------

             Summary: PigStorage should properly handle \r, \n, \r\n in record 
itself
                 Key: PIG-3828
                 URL: https://issues.apache.org/jira/browse/PIG-3828
             Project: Pig
          Issue Type: Bug
    Affects Versions: 0.11.1, 0.12.0
         Environment: Linux
            Reporter: Yubao Liu


Currently PigStorage uses 
org.apache.hadoop.mapreduce.lib.input.LineRecordReader to read lines, 
LineRecordReader thinks "\r", "\n", "\r\n" all are end of line,  if unluckily 
some record contains these special strings,  PigStorage will read less fields,  
this produces IndexOutOfBoundsException when "-schema" option is given under 
PIG <= 0.12.0:

https://svn.apache.org/repos/asf/pig/tags/release-0.12.0/src/org/apache/pig/builtin/PigStorage.java

{quote}
private Tuple applySchema(Tuple tup) throws IOException {
  ....
for (int i = 0; i < Math.min(fieldSchemas.length, tup.size()); i++) {
                if (mRequiredColumns == null || (mRequiredColumns.length>i && 
mRequiredColumns[i])) {
                    Object val = null;
                    if(tup.get(tupleIdx) != null){     <--- !!! 
IndexOutOfBoundsException
{quote}

In PIG-trunk,  null values are silently filled:

{quote}
https://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/builtin/PigStorage.java

for (int i = 0; i < fieldSchemas.length; i++) {
                if (mRequiredColumns == null || (mRequiredColumns.length>i && 
mRequiredColumns[i])) {
                    if (tupleIdx >= tup.size()) {
                        tup.append(null);                     <--- !!! silently 
fill null
                    }
                    
                    Object val = null;
                    if(tup.get(tupleIdx) != null){
{quote}

The behaviour of PIG-trunk is still error-prone:
* null is silently filled for current record,  the user's PIG script may not 
realize that field can be null thus get NullPointerException
*  the next record is totally garbled because it starts from the middle of 
previous record,  the data types of each fields in this record are totally 
wrong, so this probably breaks user's PIG script.

Before PigStorage supports customized record separator,  this may be a 
not-so-bad workaround for this nasty issue:  usually there is very small chance 
the first record containing record separator,  PigStorage can save 
maxFieldsNumber in PigStorage.getNext(),  if PigStorage.getNext() parses a 
record and find it has less fields, it just throws current record,  the next 
half of current record will be thrown too because it must also have less 
fields.  By this way,  PigStorage can throw bad records at best effort.




--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to