Yubao Liu created PIG-3828:
------------------------------
Summary: PigStorage should properly handle \r, \n, \r\n in record
itself
Key: PIG-3828
URL: https://issues.apache.org/jira/browse/PIG-3828
Project: Pig
Issue Type: Bug
Affects Versions: 0.11.1, 0.12.0
Environment: Linux
Reporter: Yubao Liu
Currently PigStorage uses
org.apache.hadoop.mapreduce.lib.input.LineRecordReader to read lines,
LineRecordReader thinks "\r", "\n", "\r\n" all are end of line, if unluckily
some record contains these special strings, PigStorage will read less fields,
this produces IndexOutOfBoundsException when "-schema" option is given under
PIG <= 0.12.0:
https://svn.apache.org/repos/asf/pig/tags/release-0.12.0/src/org/apache/pig/builtin/PigStorage.java
{quote}
private Tuple applySchema(Tuple tup) throws IOException {
....
for (int i = 0; i < Math.min(fieldSchemas.length, tup.size()); i++) {
if (mRequiredColumns == null || (mRequiredColumns.length>i &&
mRequiredColumns[i])) {
Object val = null;
if(tup.get(tupleIdx) != null){ <--- !!!
IndexOutOfBoundsException
{quote}
In PIG-trunk, null values are silently filled:
{quote}
https://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/builtin/PigStorage.java
for (int i = 0; i < fieldSchemas.length; i++) {
if (mRequiredColumns == null || (mRequiredColumns.length>i &&
mRequiredColumns[i])) {
if (tupleIdx >= tup.size()) {
tup.append(null); <--- !!! silently
fill null
}
Object val = null;
if(tup.get(tupleIdx) != null){
{quote}
The behaviour of PIG-trunk is still error-prone:
* null is silently filled for current record, the user's PIG script may not
realize that field can be null thus get NullPointerException
* the next record is totally garbled because it starts from the middle of
previous record, the data types of each fields in this record are totally
wrong, so this probably breaks user's PIG script.
Before PigStorage supports customized record separator, this may be a
not-so-bad workaround for this nasty issue: usually there is very small chance
the first record containing record separator, PigStorage can save
maxFieldsNumber in PigStorage.getNext(), if PigStorage.getNext() parses a
record and find it has less fields, it just throws current record, the next
half of current record will be thrown too because it must also have less
fields. By this way, PigStorage can throw bad records at best effort.
--
This message was sent by Atlassian JIRA
(v6.2#6252)