[jira] [Commented] (PIG-3828) PigStorage should properly handle \r, \n, \r\n in record itself

Daniel Dai (JIRA) Thu, 27 Mar 2014 17:06:13 -0700

    [ 
https://issues.apache.org/jira/browse/PIG-3828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13950179#comment-13950179
 ]


Daniel Dai commented on PIG-3828:
---------------------------------

This is a duplication of PIG-836. With MAPREDUCE-2254, you shall able to set 
textinputformat.record.delimiter if you use Hadoop 0.23+. Are you able to try 
it?

> PigStorage should properly handle \r, \n, \r\n in record itself
> ---------------------------------------------------------------
>
>                 Key: PIG-3828
>                 URL: https://issues.apache.org/jira/browse/PIG-3828
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.12.0, 0.11.1
>         Environment: Linux
>            Reporter: Yubao Liu
>
> Currently PigStorage uses 
> org.apache.hadoop.mapreduce.lib.input.LineRecordReader to read lines, 
> LineRecordReader thinks "\r", "\n", "\r\n" all are end of line,  if unluckily 
> some record contains these special strings,  PigStorage will read less 
> fields,  this produces IndexOutOfBoundsException when "-schema" option is 
> given under PIG <= 0.12.0:
> https://svn.apache.org/repos/asf/pig/tags/release-0.12.0/src/org/apache/pig/builtin/PigStorage.java
> {quote}
> private Tuple applySchema(Tuple tup) throws IOException {
>   ....
> for (int i = 0; i < Math.min(fieldSchemas.length, tup.size()); i++) {
>                 if (mRequiredColumns == null || (mRequiredColumns.length>i && 
> mRequiredColumns[i])) {
>                     Object val = null;
>                     if(tup.get(tupleIdx) != null){     <--- !!! 
> IndexOutOfBoundsException
> {quote}
> In PIG-trunk,  null values are silently filled:
> {quote}
> https://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/builtin/PigStorage.java
> for (int i = 0; i < fieldSchemas.length; i++) {
>                 if (mRequiredColumns == null || (mRequiredColumns.length>i && 
> mRequiredColumns[i])) {
>                     if (tupleIdx >= tup.size()) {
>                         tup.append(null);                     <--- !!! 
> silently fill null
>                     }
>                     
>                     Object val = null;
>                     if(tup.get(tupleIdx) != null){
> {quote}
> The behaviour of PIG-trunk is still error-prone:
> * null is silently filled for current record,  the user's PIG script may not 
> realize that field can be null thus get NullPointerException
> *  the next record is totally garbled because it starts from the middle of 
> previous record,  the data types of each fields in this record are totally 
> wrong, so this probably breaks user's PIG script.
> Before PigStorage supports customized record separator,  this may be a 
> not-so-bad workaround for this nasty issue:  usually there is very small 
> chance the first record containing record separator,  PigStorage can save 
> maxFieldsNumber in PigStorage.getNext(),  if PigStorage.getNext() parses a 
> record and find it has less fields, it just throws current record,  the next 
> half of current record will be thrown too because it must also have less 
> fields.  By this way,  PigStorage can throw bad records at best effort.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (PIG-3828) PigStorage should properly handle \r, \n, \r\n in record itself

Reply via email to