I often meet errors with chararray bag items. It seems that a bag
item can be casted to some other type rather than specified chararray
type. May be it's just before becoming a true chararray value. But
it can produce strange errors.
I suppose that there is a try to recognize bag item type somewhere in
deserializer, right? So why the user specified type is not used
directly. And what are the symbols that a string should not have to
be not casted to other type?
The latest issue with bags:
a = load 'a' as (word: chararray, length: long, phrases: bag{t:
tuple(id: chararray)});
b = order a by word;
store b into 'b';
It gives lots of errors like:
2009-01-10 20:01:49,507 [main] ERROR
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher
- Error message from task (map)
task_200901101933_0007_m_000077java.lang.RuntimeException: Unexpected
data type 116 found in stream.
at
org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:117)
at org.apache.pig.builtin.BinStorage.getNext(BinStorage.java:90)
at
org.apache.pig.impl.builtin.RandomSampleLoader.getNext(RandomSampleLoader.java:44)
at
org.apache.pig.backend.executionengine.PigSlice.next(PigSlice.java:101)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.SliceWrapper$1.next(SliceWrapper.java:157)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.SliceWrapper$1.next(SliceWrapper.java:133)
at
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:186)
at
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:170)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
at org.apache.hadoop.mapred.Child.main(Child.java:155)
The number in "Unexpected data type 116 found in stream" message varies.
2009-01-10 20:01:49,507 [main] ERROR
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher
- Error message from task (map)
task_200901101933_0007_m_000078java.lang.OutOfMemoryError: Java heap
space
at
org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:105)
at org.apache.pig.builtin.BinStorage.getNext(BinStorage.java:90)
at
org.apache.pig.impl.builtin.RandomSampleLoader.getNext(RandomSampleLoader.java:44)
at
org.apache.pig.backend.executionengine.PigSlice.next(PigSlice.java:101)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.SliceWrapper$1.next(SliceWrapper.java:157)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.SliceWrapper$1.next(SliceWrapper.java:133)
at
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:186)
at
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:170)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
at org.apache.hadoop.mapred.Child.main(Child.java:155)
The data files are not very large and mapred.child.java.opts options
is -Xmx2048m.
If column 'phrases' is filtered out before ordering, everything is ok.
What is wrong with my bags usage?
Thanks.