I am trying out Hive, using Cloudera's EC2 distribution (Hadoop
0.18.3, Hive 0.4.1, I believe)
I'm trying to run the following query which causes every map task to
fail with an NPE before making any progress:
java.lang.NullPointerException
at
org.apache.hadoop.hive.serde2.lazy.LazyStruct.uncheckedGetField(LazyStruct.java:205)
at
org.apache.hadoop.hive.serde2.lazy.LazyStruct.getField(LazyStruct.java:182)
at
org.apache.hadoop.hive.serde2.objectinspector.LazySimpleStructObjectInspector.getStructFieldData(LazySimpleStructObjectInspector.java:141)
at
org.apache.hadoop.hive.ql.exec.ExprNodeColumnEvaluator.evaluate(ExprNodeColumnEvaluator.java:53)
at
org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:74)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:332)
at
org.apache.hadoop.hive.ql.exec.TableScanOperator.process(TableScanOperator.java:49)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:332)
at
org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:175)
at org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:71)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
The query:
-- Get the node's max price and corresponding year/day/hour/month
select isone.node_id, isone.day, isone.hour, isone.lmp
from (select max(lmp) as mlmp, node_id
from isone_lmp
where isone_lmp.node_id = 400
group by node_id) maxlmp
join isone_lmp isone on ( isone.node_id = maxlmp.node_id
and isone.lmp=maxlmp.mlmp );
The table:
CREATE TABLE isone_lmp (
node_id int,
day string,
hour int,
minute int,
energy float,
congestion float,
loss float,
lmp float
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
The data looks like the following:
396,20090120,00,00,62.77,0,.78,63.55
397,20090120,00,00,62.77,0,.65,63.42
398,20090120,00,00,62.77,0,.65,63.42
399,20090120,00,00,62.77,0,.65,63.42
400,20090120,00,00,62.77,0,.65,63.42
401,20090120,00,00,62.77,0,-1.02,61.75
405,20090120,00,00,62.77,0,.21,62.98
It's about 15GB of data total; I can do a simple "select count(1) from
isone_lmp;" which executes as expected. Any thoughts? I've been able
to execute the same query on a smaller subset of data (2M rows as
opposed to 500M) on a non-distributed setup locally.
Thanks.
-Tom