hadoop input sampler

exception Thu, 18 Nov 2010 04:09:40 -0800

Hi all,

I am trying to sample the key distribution before making a total sort. But the 
programs failed and throw an exception.
This is the stack:


Exception in thread "main" java.lang.NullPointerException
        at 
org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:149)
        at 
org.apache.hadoop.mapreduce.lib.partition.InputSampler$RandomSampler.getSample(InputSampler.java:220)
        at 
org.apache.hadoop.mapreduce.lib.partition.InputSampler.writePartitionFile(InputSampler.java:315)
        at Sorter.run(Sorter.java:100)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
        at Sorter.main(Sorter.java:114)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:192)

I check the code in LineRecordReader.java. And find that the exception is 
caused by this line:
newSize = in.readLine(value, maxLineLength,Math.max(maxBytesToConsume(pos), 
maxLineLength));

"in" is a null pointer. I specify the input format as "TextInputFormat". It 
looks like TextInputFormat fails to read the data. Any ideas on how to fix 
this?  Thanks


I am under hadoop 0.21.0 and my job set up is:
......
job.setInputFormatClass(TextInputFormat.class);
job.setPartitionerClass(TotalOrderPartitioner.class);
InputSampler.Sampler<LongWritable, Text> sampler = new 
InputSampler.RandomSampler<LongWritable, Text>(0.1, 10000, 10);

Path input = FileInputFormat.getInputPaths(job)[0];
input = input.makeQualified(input.getFileSystem(conf));
Path partitionFile = new Path(input, "_partitions");
TotalOrderPartitioner.setPartitionFile(conf, partitionFile);

InputSampler.writePartitionFile(job, sampler);

URI partitionUri = new URI(partitionFile.toString() + "#_partitions");
DistributedCache.addCacheFile(partitionUri, conf);
DistributedCache.createSymlink(conf);
......

hadoop input sampler

Reply via email to