[
https://issues.apache.org/jira/browse/PIG-3828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13950179#comment-13950179
]
Daniel Dai commented on PIG-3828:
---------------------------------
This is a duplication of PIG-836. With MAPREDUCE-2254, you shall able to set
textinputformat.record.delimiter if you use Hadoop 0.23+. Are you able to try
it?
> PigStorage should properly handle \r, \n, \r\n in record itself
> ---------------------------------------------------------------
>
> Key: PIG-3828
> URL: https://issues.apache.org/jira/browse/PIG-3828
> Project: Pig
> Issue Type: Bug
> Affects Versions: 0.12.0, 0.11.1
> Environment: Linux
> Reporter: Yubao Liu
>
> Currently PigStorage uses
> org.apache.hadoop.mapreduce.lib.input.LineRecordReader to read lines,
> LineRecordReader thinks "\r", "\n", "\r\n" all are end of line, if unluckily
> some record contains these special strings, PigStorage will read less
> fields, this produces IndexOutOfBoundsException when "-schema" option is
> given under PIG <= 0.12.0:
> https://svn.apache.org/repos/asf/pig/tags/release-0.12.0/src/org/apache/pig/builtin/PigStorage.java
> {quote}
> private Tuple applySchema(Tuple tup) throws IOException {
> ....
> for (int i = 0; i < Math.min(fieldSchemas.length, tup.size()); i++) {
> if (mRequiredColumns == null || (mRequiredColumns.length>i &&
> mRequiredColumns[i])) {
> Object val = null;
> if(tup.get(tupleIdx) != null){ <--- !!!
> IndexOutOfBoundsException
> {quote}
> In PIG-trunk, null values are silently filled:
> {quote}
> https://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/builtin/PigStorage.java
> for (int i = 0; i < fieldSchemas.length; i++) {
> if (mRequiredColumns == null || (mRequiredColumns.length>i &&
> mRequiredColumns[i])) {
> if (tupleIdx >= tup.size()) {
> tup.append(null); <--- !!!
> silently fill null
> }
>
> Object val = null;
> if(tup.get(tupleIdx) != null){
> {quote}
> The behaviour of PIG-trunk is still error-prone:
> * null is silently filled for current record, the user's PIG script may not
> realize that field can be null thus get NullPointerException
> * the next record is totally garbled because it starts from the middle of
> previous record, the data types of each fields in this record are totally
> wrong, so this probably breaks user's PIG script.
> Before PigStorage supports customized record separator, this may be a
> not-so-bad workaround for this nasty issue: usually there is very small
> chance the first record containing record separator, PigStorage can save
> maxFieldsNumber in PigStorage.getNext(), if PigStorage.getNext() parses a
> record and find it has less fields, it just throws current record, the next
> half of current record will be thrown too because it must also have less
> fields. By this way, PigStorage can throw bad records at best effort.
--
This message was sent by Atlassian JIRA
(v6.2#6252)